ML Modelos Supervisionados Clássicos

Projeto Final (2) - House Prices

Preparação Inicial

Definir diretório e pacotes necessários.

load.pks = c(
  "readr",
  "ggplot2",
  "plotly",
  "e1071",
  "dplyr",
  "tidyr",
  "nortest",
  "gridExtra",
  "kableExtra",
  "visdat",
  "stringr",
  "mlbench",
  "caret",
  "GGally",
  "RColorBrewer",
  "ROCR",
  "pROC",
  "rmarkdown",
  "qqplotr",
  "reshape2",
  "ggfortify",
  "car",
  "vip",
  "MASS"
)

lapply(load.pks, require, character.only = TRUE)

Entendimento do Negócio [Business Understanding]

Objetivo

O objetivo é prever o perço final de uma casa utilizando o banco de dados que contem uma lista de casas em Ames, Iowa. Esse banco de dados conta com 79 variáveis explicativas que ajudam a determinar o preço das casas.

Entendimento dos Dados [Data Understanding]

Descrição das Variáveis

train.csv - Registros de cerca de metade (~1460) das casas.

  • Id - Um ID exclusivo para cada casa.

  • MSSubClass - Identifica o tipo de habitação.

    • 20: 1-STORY 1946 & NEWER ALL STYLES
    • 30: 1-STORY 1945 & OLDER
    • 40: 1-STORY W/FINISHED ATTIC ALL AGES
    • 45: 1-1/2 STORY - UNFINISHED ALL AGES
    • 50: 1-1/2 STORY FINISHED ALL AGES
    • 60: 2-STORY 1946 & NEWER
    • 70: 2-STORY 1945 & OLDER
    • 75: 2-1/2 STORY ALL AGES
    • 80: SPLIT OR MULTI-LEVEL
    • 85: SPLIT FOYER
    • 90: DUPLEX - ALL STYLES AND AGES
    • 120: 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
    • 150: 1-1/2 STORY PUD - ALL AGES
    • 160: 2-STORY PUD - 1946 & NEWER
    • 180: PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
    • 190: 2 FAMILY CONVERSION - ALL STYLES AND AGES

  • MSZoning - Identifica a área em que está situado o imóvel.

    • A: Agriculture
    • C: Commercial
    • FV: Floating Village Residential
    • I: Industrial
    • RH: Residential High Density
    • RL: Residential Low Density
    • RP: Residential Low Density Park
    • RM: Residential Medium Density

  • LotFrontage - Distância em pés da área da frente da casa até a rua.

  • LotArea - Tamanho da casa em pés quadrados.

  • Street - Identifica o tipo de rua que dá acesso à propriedade.

    • Grvl: Gravel
    • Pave: Paved

  • Alley - Identifica o tipo de beco/ruela que dá acesso à propriedade.

    • Grvl: Gravel
    • Pave: Paved
    • NA: No alley access

  • LotShape - Identifica o formato do terreno da propriedade.

    • Reg: Regular
    • IR1: Slightly irregular
    • IR2: Moderately Irregular
    • IR3: Irregular

  • LandContour - Identifica o quao plano é o terreno da propriedade.

    • Lvl: Near Flat/Level
    • Bnk: Banked - Quick and significant rise from street grade to building
    • HLS: Hillside - Significant slope from side to side
    • Low: Depression

  • Utilities - Identifica o tipo de utilidades disponíveis.

    • AllPub: All public Utilities (E,G,W,& S)
    • NoSewr: Electricity, Gas, and Water (Septic Tank)
    • NoSeWa: Electricity and Gas Only
    • ELO: Electricity only
  • LotConfig - Identifica a configuração do terreno.

    • Inside: Inside lot
    • Corner: Corner lot
    • CulDSac: Cul-de-sac
    • FR2: Frontage on 2 sides of property
    • FR3: Frontage on 3 sides of property

  • LandSlope - Identifica o declive da propriedade.

    • Gtl: Gentle slope
    • Mod: Moderate Slope
    • Sev: Severe Slope

  • Neighborhood - Identifica as localidades físicas dentro dos limites da cidade de Ames.

    • Blmngtn: Bloomington Heights
    • Blueste: Bluestem
    • BrDale: Briardale
    • BrkSide: Brookside
    • ClearCr: Clear Creek
    • CollgCr: College Creek
    • Crawfor: Crawford
    • Edwards: Edwards
    • Gilbert: Gilbert
    • IDOTRR: Iowa DOT and Rail Road
    • MeadowV: Meadow Village
    • Mitchel: Mitchell
    • Names: North Ames
    • NoRidge: Northridge
    • NPkVill: Northpark Villa
    • NridgHt: Northridge Heights
    • NWAmes: Northwest Ames
    • OldTown: Old Town
    • SWISU: South & West of Iowa State University
    • Sawyer: Sawyer
    • SawyerW: Sawyer West
    • Somerst: Somerset
    • StoneBr: Stone Brook
    • Timber: Timberland
    • Veenker: Veenker

  • Condition1 - Identifica a proximidade a condições diversas.

    • Artery: Adjacent to arterial street
    • Feedr: Adjacent to feeder street
    • Norm: Normal
    • RRNn: Within 200’ of North-South Railroad
    • RRAn: Adjacent to North-South Railroad
    • PosN: Near positive off-site feature–park, greenbelt, etc.
    • PosA: Adjacent to postive off-site feature
    • RRNe: Within 200’ of East-West Railroad
    • RRAe: Adjacent to East-West Railroad

  • Condition2 - Identifica a proximidade a condições diversas (se mais de uma está presente).

    • Artery: Adjacent to arterial street
    • Feedr: Adjacent to feeder street
    • Norm: Normal
    • RRNn: Within 200’ of North-South Railroad
    • RRAn: Adjacent to North-South Railroad
    • PosN: Near positive off-site feature–park, greenbelt, etc.
    • PosA: Adjacent to postive off-site feature
    • RRNe: Within 200’ of East-West Railroad
    • RRAe: Adjacent to East-West Railroad

  • BldgType - Identfica o tipo de habitação.

    • 1Fam: Single-family Detached
    • 2FmCon: Two-family Conversion; originally built as one-family dwelling
    • Duplx: Duplex
    • TwnhsE: Townhouse End Unit
    • TwnhsI: Townhouse Inside Unit

  • HouseStyle - Identifica o estilo da habitação.

    • 1Story: One story
    • 1.5Fin: One and one-half story: 2nd level finished
    • 1.5Unf: One and one-half story: 2nd level unfinished
    • 2Story: Two story
    • 2.5Fin: Two and one-half story: 2nd level finished
    • 2.5Unf: Two and one-half story: 2nd level unfinished
    • SFoyer: Split Foyer
    • SLvl: Split Level

  • OverallQual - Avaliação geral do material e acabamento da casa.

    • 10: Very Excellent
    • 9: Excellent
    • 8: Very Good
    • 7: Good
    • 6: Above Average
    • 5: Average
    • 4: Below Average
    • 3: Fair
    • 2: Poor
    • 1: Very Poor

  • OverallCond - Avaliação geral da condição da casa.

    • 10: Very Excellent
    • 9: Excellent
    • 8: Very Good
    • 7: Good
    • 6: Above Average
    • 5: Average
    • 4: Below Average
    • 3: Fair
    • 2: Poor
    • 1: Very Poor

  • YearBuilt - Ano em que a casa foi contruída.

  • YearRemodAdd - Ano em que a casa foi remodelada (será igual a da construção se nenhuma adição ou remodelagem foi feita).

  • RoofStyle - Identifica o tipo de telhado.

    • Flat: Flat
    • Gable: Gable
    • Gambrel: Gabrel (Barn)
    • Hip: Hip
    • Mansard: Mansard
    • Shed: Shed

  • RoofMatl - Itentifica o tipo e material do telhado.

    • ClyTile: Clay or Tile
    • CompShg: Standard (Composite) Shingle
    • Membran: Membrane
    • Metal: Metal
    • Roll: Roll
    • Tar&Grv: Gravel & Tar
    • WdShake: Wood Shakes
    • WdShngl: Wood Shingles

  • Exterior1st - Identifica o acabamento externo da casa.

    • AsbShng: Asbestos Shingles
    • AsphShn: Asphalt Shingles
    • BrkComm: Brick Common
    • BrkFace: Brick Face
    • CBlock: Cinder Block
    • CemntBd: Cement Board
    • HdBoard: Hard Board
    • ImStucc: Imitation Stucco
    • MetalSd: Metal Siding
    • Other: Other
    • Plywood: Plywood
    • PreCast: PreCast
    • Stone: Stone
    • Stucco: Stucco
    • VinylSd: Vinyl Siding
    • Wd Sdng: Wood Siding
    • WdShing: Wood Shingles

  • Exterior2nd - Identifica o acabamento externo da casa (se exixtir mais de um).

    • AsbShng: Asbestos Shingles
    • AsphShn: Asphalt Shingles
    • BrkComm: Brick Common
    • BrkFace: Brick Face
    • CBlock: Cinder Block
    • CemntBd: Cement Board
    • HdBoard: Hard Board
    • ImStucc: Imitation Stucco
    • MetalSd: Metal Siding
    • Other: Other
    • Plywood: Plywood
    • PreCast: PreCast
    • Stone: Stone
    • Stucco: Stucco
    • VinylSd: Vinyl Siding
    • Wd Sdng: Wood Siding
    • WdShing: Wood Shingles

  • MasVnrType - Identifica o tipo de alvenaria.

    • BrkCmn: Brick Common
    • BrkFace: Brick Face
    • CBlock: Cinder Block
    • None: None
    • Stone: Stone

  • MasVnrArea - Identifica a área em pés quadrados da alvenaria.

  • ExterQual - Avaliação da qualidade do material do exterior da construção.

    • Ex: Excellent
    • Gd: Good
    • TA: Average/Typical
    • Fa: Fair
    • Po: Poor

  • ExterCond - Avaliação da condição atual do material do exterior da contrução.

    • Ex: Excellent
    • Gd: Good
    • TA: Average/Typical
    • Fa: Fair
    • Po: Poor

  • Foundation - Identifica o tipo de fundação.

    • BrkTil: Brick & Tile
    • CBlock: Cinder Block
    • PConc: Poured Contrete
    • Slab: Slab
    • Stone: Stone
    • Wood: Wood

  • BsmtQual - Avaliação da altura do porão.

    • Ex: Excellent (100+ inches)
    • Gd: Good (90-99 inches)
    • TA: Typical (80-89 inches)
    • Fa: Fair (70-79 inches)
    • Po: Poor (<70 inches
    • NA: No Basement

  • BsmtCond - Avaliação da condição geral do porão.

    • Ex: Excellent
    • Gd: Good
    • TA: Typical - slight dampness allowed
    • Fa: Fair - dampness or some cracking or settling
    • Po: Poor - Severe cracking, settling, or wetness
    • NA: No Basement

  • BsmtExposure - Identifica o tipo de exposição do porão para o jardim.

    • Gd: Good Exposure
    • Av: Average Exposure (split levels or foyers typically score average or above)
    • Mn: Mimimum Exposure
    • No: No Exposure
    • NA: No Basement

  • BsmtFinType1 - Classificação da área contruída tipo 1 do porão.

    • GLQ: Good Living Quarters
    • ALQ: Average Living Quarters
    • BLQ: Below Average Living Quarters
    • Rec: Average Rec Room
    • LwQ: Low Quality
    • Unf: Unfinshed
    • NA: No Basement

  • BsmtFinSF1 - Área contruída tipo 1 do porão em pés quadrados.

  • BsmtFinType2 - Classificação da área contruída tipo 2 do porão (se mais de um).

    • GLQ: Good Living Quarters
    • ALQ: Average Living Quarters
    • BLQ: Below Average Living Quarters
    • Rec: Average Rec Room
    • LwQ: Low Quality
    • Unf: Unfinshed
    • NA: No Basement

  • BsmtFinSF2 - Área contruída tipo 1 do porão em pés quadrados.

  • BsmtUnfSF - Área não contruída do porão em pés quadrados.

  • TotalBsmtSF - Área total do porão em pés quadrados.

  • Heating - Identifica o tipo de sistema de aquecimento.

    • Floor: Floor Furnace
    • GasA: Gas forced warm air furnace
    • GasW: Gas hot water or steam heat
    • Grav: Gravity furnace
    • OthW: Hot water or steam heat other than gas
    • Wall: Wall furnace

  • HeatingQC - Avaliação da qualidade e condição dos sistema de aquecimento.

    • Ex: Excellent
    • Gd: Good
    • TA: Average/Typical
    • Fa: Fair
    • Po: Poor

  • CentralAir - Identifica a existência de sistema de ar condicionado.

    • N: No
    • Y: Yes

  • Electrical - Identifica o tipo de sistema elétrico.

    • SBrkr: Standard Circuit Breakers & Romex
    • FuseA: Fuse Box over 60 AMP and all Romex wiring (Average)
    • FuseF: 60 AMP Fuse Box and mostly Romex wiring (Fair)
    • FuseP: 60 AMP Fuse Box and mostly knob & tube wiring (poor)
    • Mix: Mixed

  • 1stFlrSF - Área do primeiro andar em pés quadrados.

  • 2ndFlrSF - Área do segundo andar em pés quadrados.

  • LowQualFinSF - Área com acabamento de baixa qualidade em pés quadrados (todos os andares).

  • GrLivArea - Área habitável acima do solo em pés quadrados.

  • BsmtFullBath - Quantidade de banheiros completos no porão.

  • BsmtHalfBath - Quantidade de lavabos no porão.

  • FullBath - Quantidade de banheiros completos na área habitável acima do solo.

  • HalfBath - Quantidade de lavabos na área habitável acima do solo.

  • Bedroom - Quantidade de quartos na área habitável acima do solo (Não inclui o porão).

  • Kitchen - Quantidade de cozinhas na área habitável acima do solo.

  • KitchenQual - Avaliação da qualidade da cozinha.

    • Ex: Excellent
    • Gd: Good
    • TA: Typical/Average
    • Fa: Fair
    • Po: Poor

  • TotRmsAbvGrd - Quantidade total de cômodoss na área habitável acima do solo (Não inclui banheiros).

  • Functional - Idintifica a funcionaidade da propriedade (assume como “Típica” a menos que haja alguma dedução).

    • Typ: Typical Functionality
    • Min1: Minor Deductions 1
    • Min2: Minor Deductions 2
    • Mod: Moderate Deductions
    • Maj1: Major Deductions 1
    • Maj2: Major Deductions 2
    • Sev: Severely Damaged
    • Sal: Salvage only

  • Fireplaces - Quantidade de lareiras.

  • FireplaceQu - Avaliação da qualidade da lareira.

    • Ex: Excellent - Exceptional Masonry Fireplace
    • Gd: Good - Masonry Fireplace in main level
    • TA: Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
    • Fa: Fair - Prefabricated Fireplace in basement
    • Po: Poor - Ben Franklin Stove
    • NA: No Fireplace

  • GarageType - Identifica a localização da garagem.

    • 2Types: More than one type of garage
    • Attchd: Attached to home
    • Basment: Basement Garage
    • BuiltIn: Built-In (Garage part of house - typically has room above garage)
    • CarPort: Car Port
    • Detchd: Detached from home
    • NA: No Garage

  • GarageYrBlt - Ano de contrução da garagem.

  • GarageFinish - Identifica o acabamento interno da garagem.

    • Fin: Finished
    • RFn: Rough Finished
    • Unf: Unfinished
    • NA: No Garage

  • GarageCars - Quantidade de carros que a garagem comporta.

  • GarageArea - Tamanho da garagem em pés quadrados.

  • GarageQual - Identifica a qualidade da garagem.

    • Ex: Excellent
    • Gd: Good
    • TA: Typical/Average
    • Fa: Fair
    • Po: Poor
    • NA: No Garage

  • GarageCond - Identifica a condição da garagem.

    • Ex: Excellent
    • Gd: Good
    • TA: Typical/Average
    • Fa: Fair
    • Po: Poor
    • NA: No Garage

  • PavedDrive - Identifica se a entrada da gragem é pavimentada.

    • Y: Paved
    • P: Partial Pavement
    • N: Dirt/Gravel

  • WoodDeckSF - Área do deck de madeira em pés quadrados.

  • OpenPorchSF - Área da varanda em pés quadrados.

  • EnclosedPorch - Área da varanda fechada (telhado) em pés quadrados.

  • 3SsnPorch - Área da varanda fechada (3 seasons) em pés quadrados.

  • ScreenPorch - Área da varanda fechada (tela) em pés quadrados.

  • PoolArea - Área da da piscina em pés quadrados.

  • PoolQC - Identifica a qualidade da piscina.

    • Ex: Excellent
    • Gd: Good
    • TA: Average/Typical
    • Fa: Fair
    • NA: No Pool

  • Fence - Identifica a qualidade da cerca.

    • GdPrv: Good Privacy
    • MnPrv: Minimum Privacy
    • GdWo: Good Wood
    • MnWw: Minimum Wood/Wire
    • NA: No Fence

  • MiscFeature - Identifica coisas variadas que não foram cobertas nas categorias anteriores.

    • Elev: Elevator
    • Gar2: 2nd Garage (if not described in garage section)
    • Othr: Other
    • Shed: Shed (over 100 SF)
    • TenC: Tennis Court
    • NA: None

  • MiscVal - Valor em dólares das coisas variádas descritas acima.

  • MoSold - Mês da venda (MM).

  • YrSold - Ano da venda (YYYY).

  • SaleType - Identifica o tipo da venda.

    • WD: Warranty Deed - Conventional
    • CWD: Warranty Deed - Cash
    • VWD: Warranty Deed - VA Loan
    • New: Home just constructed and sold
    • COD: Court Officer Deed/Estate
    • Con: Contract 15% Down payment regular terms
    • ConLw: Contract Low Down payment and low interest
    • ConLI: Contract Low Interest
    • ConLD: Contract Low Down
    • Oth: Other

  • SaleCondition - Condição da venda.

    • Normal: Normal Sale
    • Abnorml: Abnormal Sale - trade, foreclosure, short sale
    • AdjLand: Adjoining Land Purchase
    • Alloca: Allocation - two linked properties with separate deeds, typically condo with a garage unit
    • Family: Sale between family members
    • Partial: Home was not completed when last assessed (associated with New Homes)

  • SalesPrice - O preço de venda da casa. Este é o target, a coluna que você está tentando prever.
    test.csv - Registros de cerca de metade (~1459) das casas, para serem usados como dados de teste.
    Sua tarefa é prever o valor de Transportado para os passageiros neste conjunto.

    sample_submission.csv - Um arquivo de envio no formato correto.

  • Id - Id para cada casa no conjunto de teste.

  • SalesPrice - O alvo. Para cada casa, preveja o valor da venda associado a ela.


Preparação dos Dados [Data Preparation]

Importação dos Dados

Vamos importar os 3 bancos: Train, Test e Sample_Submission.

ds2_train <- read_csv(".\\house-prices-regression\\train.csv")
ds2_test <- read_csv(".\\house-prices-regression\\test.csv")
ds2_submission <- read_csv(".\\house-prices-regression\\sample_submission.csv")

Antes de ajustar as variáveis, vamos combinar os dados para que as mudanças sejam feitas de forma mais rápida.

#merge between test & submission
ds2_test_submission <- merge(ds2_test, ds2_submission, by=c("Id", "Id"))

ds2_train$table <- "train"
ds2_test_submission$table <- "test"

ds2_all <- rbind(ds2_train, ds2_test_submission)

Ajuste das Variáveis

Esse banco de dados consiste basicamente em variáveis qualitativas em sua grande maioria. Ainda sim algumas mudanças precisam ser feitas para que possamos realizar a modelagem.
As variáveis OverallQal e OverallCond são Escalas de Likert pura e simplemente com 10 níveis de classificação. Existem outras variáveis no banco como ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, FireplaceQu, GarageQual, GarageCond e PoolQC também são escalas de Likert contudo temos letras ao invés de números e essas letras também representam uma ordem. Vamos fazer a modificação dessas variáveis, principalmente para o caso em que uma das possíveis respostas para elas é “NA”, indicando que essa pergunta não se aplica naquela moradia e não deve ser removido como se fosse valor faltante.
Variáveis relativas ao porão por exemplo, só serão respondidas se a casa possui porão, então faz sentido criar uma variável dummy respondendo essa pergunta simples pois pode ser mais significante no modelo saber se tem ou não porão. O mesmo para a variável sobre a qualidade da lareira, quantidade de garagens, condição da piscina, condição da cerca e a variável MiscFeature.

Snapshot:

head(ds2_all) %>%  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice table
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500 train
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500 train
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500 train
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000 train
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000 train
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000 train

Observações faltantes

#h_train1_summary0 <- as.data.frame(which( colSums( is.na(ds2_all[ds2_all$table == "train",]) ) > 0 ))
#colnames(h_train1_summary3) <- c("Missing Observations")


h_train1_summary0 <- as.data.frame(t(ds2_all[ds2_all$table == "train",] %>% 
  summarise(across(everything(), ~ sum(is.na(.x)) ))))
colnames(h_train1_summary0) <- c("Missing Observations")

h_train1_summary0 %>% 
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T) %>%
  column_spec(2, color = "black",
              background = ifelse(h_train1_summary0 > 0, "yellow", "blank")
              #,
             #popover = paste("am:", mtcars$am[1:8])
              )
Missing Observations
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
Street 0
Alley 1369
LotShape 0
LandContour 0
Utilities 0
LotConfig 0
LandSlope 0
Neighborhood 0
Condition1 0
Condition2 0
BldgType 0
HouseStyle 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 0
Exterior2nd 0
MasVnrType 8
MasVnrArea 8
ExterQual 0
ExterCond 0
Foundation 0
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinSF1 0
BsmtFinType2 38
BsmtFinSF2 0
BsmtUnfSF 0
TotalBsmtSF 0
Heating 0
HeatingQC 0
CentralAir 0
Electrical 1
1stFlrSF 0
2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
BsmtFullBath 0
BsmtHalfBath 0
FullBath 0
HalfBath 0
BedroomAbvGr 0
KitchenAbvGr 0
KitchenQual 0
TotRmsAbvGrd 0
Functional 0
Fireplaces 0
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageCars 0
GarageArea 0
GarageQual 81
GarageCond 81
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
3SsnPorch 0
ScreenPorch 0
PoolArea 0
PoolQC 1453
Fence 1179
MiscFeature 1406
MiscVal 0
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
table 0

Porão

Note que as variáveis que são relacionadas com o Porão possuem quase a mesma quantidade de observações faltantes, idicando que possivelmente se a casa não tem porão, as depois variáveis relacioandas a essa parte da casa fazem sentido serem nulas. Contudo, não temos uma variável na base que indica claramente se a casa possui ou não porão e para fins desse modelo, seria conveniente ter uma variável binária dizendo se a casa tem ou não porão.
Vamos então analisar os valores faltantes dessa variável e ver se de fato todas os faltantes são casas que não possuem porão.

ds2_all[ds2_all$table == "train",] %>% 
  dplyr::select("Id" | contains("Bsm")) %>%
  filter_all(any_vars( is.na(.)))  %>%  
    kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
18 NA NA NA NA 0 NA 0 0 0 0 0
40 NA NA NA NA 0 NA 0 0 0 0 0
91 NA NA NA NA 0 NA 0 0 0 0 0
103 NA NA NA NA 0 NA 0 0 0 0 0
157 NA NA NA NA 0 NA 0 0 0 0 0
183 NA NA NA NA 0 NA 0 0 0 0 0
260 NA NA NA NA 0 NA 0 0 0 0 0
333 Gd TA No GLQ 1124 NA 479 1603 3206 1 0
343 NA NA NA NA 0 NA 0 0 0 0 0
363 NA NA NA NA 0 NA 0 0 0 0 0
372 NA NA NA NA 0 NA 0 0 0 0 0
393 NA NA NA NA 0 NA 0 0 0 0 0
521 NA NA NA NA 0 NA 0 0 0 0 0
533 NA NA NA NA 0 NA 0 0 0 0 0
534 NA NA NA NA 0 NA 0 0 0 0 0
554 NA NA NA NA 0 NA 0 0 0 0 0
647 NA NA NA NA 0 NA 0 0 0 0 0
706 NA NA NA NA 0 NA 0 0 0 0 0
737 NA NA NA NA 0 NA 0 0 0 0 0
750 NA NA NA NA 0 NA 0 0 0 0 0
779 NA NA NA NA 0 NA 0 0 0 0 0
869 NA NA NA NA 0 NA 0 0 0 0 0
895 NA NA NA NA 0 NA 0 0 0 0 0
898 NA NA NA NA 0 NA 0 0 0 0 0
949 Gd TA NA Unf 0 Unf 0 936 936 0 0
985 NA NA NA NA 0 NA 0 0 0 0 0
1001 NA NA NA NA 0 NA 0 0 0 0 0
1012 NA NA NA NA 0 NA 0 0 0 0 0
1036 NA NA NA NA 0 NA 0 0 0 0 0
1046 NA NA NA NA 0 NA 0 0 0 0 0
1049 NA NA NA NA 0 NA 0 0 0 0 0
1050 NA NA NA NA 0 NA 0 0 0 0 0
1091 NA NA NA NA 0 NA 0 0 0 0 0
1180 NA NA NA NA 0 NA 0 0 0 0 0
1217 NA NA NA NA 0 NA 0 0 0 0 0
1219 NA NA NA NA 0 NA 0 0 0 0 0
1233 NA NA NA NA 0 NA 0 0 0 0 0
1322 NA NA NA NA 0 NA 0 0 0 0 0
1413 NA NA NA NA 0 NA 0 0 0 0 0

Repare que a casa de ID = 333 tem uma variável faltante no campo BsmtFinType2, o que deveria indicar que a casa não tem porão, contude a variável BsmtFinType1 foi preenchida e também a variável que se relaciona com a metragem contruida no tipo 2, BsmtFinSF2, indicando que de fato, a observação 333 tem um valor missing e que não sabemos qual a classificação dada para essa parte da casa.
A casa de ID = 949 parece mostrar que a casa possui um porão, mas não sabemos mais detalhes sobre o mesmo, parece não foi terminado e somente temos a metragem final. Nesse caso, ter BsmtExposure = NA significaria que a casa não possui porão o que não parenta ser verdade. Isso indica que 949 possui um valor faltante real. O mesmo ocorre com a casa de ID = 1488.
Já a casa de ID = 2041 possui um valor faltante na variável BsmtCond, indicando que esse também é um valor faltante real. O mesmo ocorre para as casa de IDs 2186 e 2525.
As casas com ID = 2218 e 2219 possuem valores faltantes reais na variável BsmtQual.
A casa com ID = 2349 possue um valor faltante real para a variável BsmtExposure.
Ou seja, a maioria dos valores faltantes se referem à casas que não possuem porão, com exceção dos descritos acima, isso deve ser considerado quando formos criar uma variável binária indicativa de ter ou não porão na casa.

Para fins de estudo, decidimos que se:
- BsmtQual = NA - BsmtCond = NA - BsmtExposure = NA - BsmtFinType1 = NA - BsmtFinSF1 = 0 - BsmtFinType2 = NA - BsmtFinSF2 = 0 - BsmtUnfSF = 0 - TotalBsmtSF = 0
então temos que a casa não possui um porão, caso contrário vamos dizer que possui. Somente com essa reclassificação já poderemos tirar os valores reais faltantes do que não são reais.

Alvenaria

Similarmente, temos para se a casa possui ou não alvernaria:
- MasVnrType = None - MasVnrArea = 0
ambas as condições tem que ocorrer.

ds2_all[ds2_all$table == "train",] %>% 
  dplyr::select("Id" | contains("MasVnr")) %>%
  filter_all(any_vars(grepl("None",.)))  %>%  
    kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id MasVnrType MasVnrArea
2 None 0
4 None 0
6 None 0
9 None 0
10 None 0
11 None 0
13 None 0
16 None 0
18 None 0
19 None 0
20 None 0
22 None 0
24 None 0
25 None 0
27 None 0
29 None 0
30 None 0
31 None 0
32 None 0
33 None 0
34 None 0
37 None 0
39 None 0
40 None 0
42 None 0
43 None 0
44 None 0
45 None 0
47 None 0
48 None 0
49 None 0
50 None 0
51 None 0
52 None 0
53 None 0
54 None 0
55 None 0
58 None 0
60 None 0
61 None 0
62 None 0
64 None 0
69 None 0
70 None 0
72 None 0
75 None 0
76 None 0
77 None 0
78 None 0
79 None 0
80 None 0
87 None 0
89 None 0
90 None 0
91 None 0
93 None 0
94 None 0
95 None 0
99 None 0
100 None 0
103 None 0
104 None 0
107 None 0
108 None 0
109 None 0
111 None 0
112 None 0
115 None 0
118 None 0
119 None 0
120 None 0
121 None 0
122 None 0
124 None 0
125 None 0
126 None 0
127 None 0
128 None 0
133 None 0
135 None 0
140 None 0
141 None 0
142 None 0
143 None 0
147 None 0
150 None 0
151 None 0
154 None 0
155 None 0
156 None 0
157 None 0
159 None 0
161 None 0
164 None 0
165 None 0
166 None 0
167 None 0
169 None 0
171 None 0
173 None 0
176 None 0
178 None 0
180 None 0
182 None 0
184 None 0
185 None 0
186 None 0
187 None 0
188 None 0
190 None 0
191 None 0
193 None 0
195 None 0
196 None 0
198 None 0
199 None 0
201 None 0
202 None 0
203 None 0
205 None 0
206 None 0
207 None 0
208 None 0
211 None 0
212 None 0
213 None 0
214 None 0
218 None 0
221 None 0
222 None 0
223 None 0
224 None 0
229 None 0
234 None 0
238 None 0
240 None 0
242 None 0
243 None 0
244 None 0
245 None 0
247 None 0
248 None 0
251 None 0
253 None 0
255 None 0
257 None 0
260 None 0
264 None 0
265 None 0
267 None 0
268 None 0
269 None 0
271 None 0
272 None 0
275 None 0
276 None 0
277 None 0
278 None 0
285 None 0
286 None 0
287 None 0
288 None 0
290 None 0
291 None 0
292 None 0
293 None 0
296 None 0
297 None 0
300 None 0
302 None 0
304 None 0
305 None 0
308 None 0
309 None 0
310 None 0
312 None 0
313 None 0
314 None 0
315 None 0
316 None 0
318 None 0
324 None 0
326 None 0
327 None 0
329 None 0
330 None 0
332 None 0
335 None 0
336 None 0
339 None 0
341 None 0
342 None 0
345 None 0
346 None 0
347 None 0
352 None 0
353 None 0
354 None 0
355 None 0
356 None 0
358 None 0
361 None 0
362 None 0
366 None 0
370 None 0
371 None 0
372 None 0
373 None 0
374 None 0
375 None 0
376 None 0
377 None 0
378 None 0
381 None 0
382 None 0
383 None 0
384 None 0
385 None 0
387 None 0
391 None 0
392 None 0
393 None 0
394 None 0
395 None 0
396 None 0
397 None 0
399 None 0
401 None 0
402 None 0
403 None 0
405 None 0
406 None 0
407 None 0
408 None 0
411 None 0
412 None 0
414 None 0
415 None 0
416 None 0
418 None 0
419 None 0
420 None 0
423 None 0
426 None 0
428 None 0
430 None 0
432 None 0
434 None 0
435 None 0
436 None 0
437 None 0
438 None 0
439 None 0
440 None 0
443 None 0
445 None 0
446 None 0
447 None 0
448 None 0
449 None 0
450 None 0
451 None 0
452 None 0
454 None 0
455 None 0
457 None 0
458 None 0
459 None 0
462 None 0
463 None 0
464 None 0
467 None 0
471 None 0
472 None 0
475 None 0
476 None 0
483 None 0
485 None 0
486 None 0
488 None 0
489 None 0
490 None 0
491 None 0
492 None 0
493 None 0
494 None 0
495 None 0
496 None 0
497 None 0
498 None 0
500 None 0
502 None 0
503 None 0
504 None 0
505 None 0
508 None 0
509 None 0
513 None 0
514 None 0
515 None 0
519 None 0
520 None 0
521 None 0
522 None 0
523 None 0
526 None 0
527 None 0
529 None 0
532 None 0
533 None 0
534 None 0
535 None 0
536 None 0
537 None 0
538 None 0
543 None 0
545 None 0
547 None 0
548 None 0
549 None 0
550 None 0
551 None 0
552 None 0
554 None 0
556 None 0
557 None 0
558 None 0
559 None 0
562 None 0
563 None 0
564 None 0
565 None 0
566 None 0
569 None 0
570 None 0
573 None 0
574 None 0
576 None 0
577 None 0
579 None 0
584 None 0
585 None 0
587 None 0
588 None 0
589 None 0
590 None 0
591 None 0
593 None 0
595 None 0
597 None 0
600 None 0
602 None 0
603 None 0
604 None 0
607 None 0
608 None 0
609 None 0
614 None 0
615 None 0
617 None 0
618 None 0
621 None 0
622 None 0
625 None 288
627 None 0
631 None 0
634 None 0
635 None 0
636 None 0
637 None 0
638 None 0
639 None 0
640 None 0
642 None 0
643 None 0
646 None 0
647 None 0
650 None 0
652 None 0
653 None 0
654 None 0
658 None 0
659 None 0
660 None 0
664 None 0
665 None 0
669 None 0
670 None 0
671 None 0
672 None 0
673 None 0
675 None 0
676 None 0
677 None 0
678 None 0
681 None 0
682 None 0
683 None 0
686 None 0
687 None 0
688 None 0
694 None 0
695 None 0
697 None 0
698 None 0
699 None 0
700 None 0
704 None 0
706 None 0
709 None 0
711 None 0
712 None 0
713 None 0
717 None 0
718 None 0
720 None 0
721 None 0
723 None 0
724 None 0
726 None 0
730 None 0
731 None 0
734 None 0
735 None 0
736 None 0
737 None 0
738 None 0
739 None 0
740 None 0
741 None 0
742 None 0
744 None 0
745 None 0
747 None 0
748 None 0
750 None 0
751 None 0
752 None 0
753 None 0
755 None 0
757 None 0
761 None 0
762 None 0
763 None 0
765 None 0
768 None 0
769 None 0
771 None 0
772 None 0
773 None 0
774 None 1
778 None 0
782 None 0
783 None 0
785 None 0
786 None 0
787 None 0
788 None 0
789 None 0
792 None 0
793 None 0
795 None 0
798 None 0
801 None 0
802 None 0
803 None 0
805 None 0
807 None 0
808 None 0
810 None 0
813 None 0
815 None 0
817 None 0
821 None 0
822 None 0
823 None 0
824 None 0
825 None 0
827 None 0
828 None 0
829 None 0
830 None 0
832 None 0
836 None 0
837 None 0
839 None 0
840 None 0
841 None 0
842 None 0
844 None 0
845 None 0
846 None 0
848 None 0
849 None 0
853 None 0
855 None 0
856 None 0
857 None 0
858 None 0
861 None 0
862 None 0
863 None 0
864 None 0
865 None 0
868 None 0
869 None 0
871 None 0
873 None 0
874 None 0
875 None 0
876 None 0
877 None 0
879 None 0
881 None 0
883 None 0
884 None 0
888 None 0
889 None 0
891 None 0
893 None 0
894 None 0
895 None 0
897 None 0
898 None 0
900 None 0
901 None 0
902 None 0
903 None 0
908 None 0
909 None 0
910 None 0
912 None 0
913 None 0
914 None 0
917 None 0
918 None 0
919 None 0
920 None 0
922 None 0
923 None 0
924 None 0
926 None 0
929 None 0
931 None 0
932 None 0
935 None 0
938 None 0
939 None 0
940 None 0
941 None 0
943 None 0
944 None 0
946 None 0
950 None 0
951 None 0
952 None 0
953 None 0
957 None 0
958 None 0
959 None 0
960 None 0
961 None 0
963 None 0
964 None 0
966 None 0
967 None 0
969 None 0
971 None 0
973 None 0
975 None 0
976 None 0
977 None 0
979 None 0
980 None 0
984 None 0
985 None 0
986 None 0
987 None 0
990 None 0
992 None 0
994 None 0
996 None 0
997 None 0
999 None 0
1001 None 0
1002 None 0
1007 None 0
1008 None 0
1010 None 0
1011 None 0
1012 None 0
1013 None 0
1014 None 0
1016 None 0
1018 None 0
1021 None 0
1023 None 0
1025 None 0
1026 None 0
1029 None 0
1031 None 0
1032 None 0
1033 None 0
1035 None 0
1036 None 0
1039 None 0
1040 None 0
1045 None 0
1046 None 0
1048 None 0
1050 None 0
1051 None 0
1052 None 0
1058 None 0
1061 None 0
1062 None 0
1063 None 0
1064 None 0
1066 None 0
1067 None 0
1069 None 0
1070 None 0
1072 None 0
1073 None 0
1075 None 0
1076 None 0
1077 None 0
1078 None 0
1080 None 0
1082 None 0
1085 None 0
1086 None 0
1087 None 0
1088 None 0
1090 None 0
1091 None 0
1093 None 0
1095 None 0
1096 None 0
1097 None 0
1098 None 0
1099 None 0
1101 None 0
1102 None 0
1104 None 0
1107 None 0
1108 None 0
1109 None 0
1111 None 0
1113 None 0
1114 None 0
1115 None 0
1117 None 0
1118 None 0
1119 None 0
1120 None 0
1121 None 0
1123 None 0
1124 None 0
1126 None 0
1128 None 0
1129 None 0
1131 None 0
1132 None 0
1133 None 0
1134 None 0
1135 None 0
1136 None 0
1137 None 0
1138 None 0
1139 None 0
1140 None 0
1141 None 0
1144 None 0
1145 None 0
1146 None 0
1148 None 0
1149 None 0
1150 None 0
1151 None 0
1153 None 0
1154 None 0
1156 None 0
1157 None 0
1161 None 0
1163 None 0
1164 None 0
1165 None 0
1168 None 0
1169 None 0
1171 None 0
1172 None 0
1173 None 0
1174 None 0
1175 None 0
1177 None 0
1178 None 0
1179 None 0
1180 None 0
1181 None 0
1183 None 0
1184 None 0
1186 None 0
1187 None 0
1188 None 0
1190 None 0
1193 None 0
1196 None 0
1197 None 0
1198 None 0
1199 None 0
1201 None 0
1202 None 0
1203 None 0
1205 None 0
1207 None 0
1209 None 0
1211 None 0
1212 None 0
1213 None 0
1214 None 0
1215 None 0
1216 None 0
1217 None 0
1219 None 0
1221 None 0
1222 None 0
1225 None 0
1230 None 0
1231 None 1
1233 None 0
1235 None 0
1236 None 0
1238 None 0
1239 None 0
1245 None 0
1247 None 0
1249 None 0
1250 None 0
1251 None 0
1253 None 0
1254 None 0
1256 None 0
1258 None 0
1260 None 0
1261 None 0
1262 None 0
1263 None 0
1264 None 0
1265 None 0
1267 None 0
1268 None 0
1271 None 0
1272 None 0
1275 None 0
1277 None 0
1280 None 0
1283 None 0
1284 None 0
1285 None 0
1286 None 0
1289 None 0
1293 None 0
1294 None 0
1295 None 0
1300 None 0
1301 None 344
1302 None 0
1308 None 0
1309 None 0
1311 None 0
1315 None 0
1316 None 0
1318 None 0
1321 None 0
1322 None 0
1323 None 0
1324 None 0
1326 None 0
1327 None 0
1328 None 0
1329 None 0
1330 None 0
1332 None 0
1333 None 0
1334 None 0
1335 None 312
1338 None 0
1339 None 0
1340 None 0
1341 None 0
1342 None 0
1344 None 0
1345 None 0
1346 None 0
1347 None 0
1349 None 0
1350 None 0
1353 None 0
1355 None 0
1358 None 0
1361 None 0
1363 None 0
1364 None 0
1365 None 0
1366 None 0
1368 None 0
1371 None 0
1372 None 0
1373 None 0
1375 None 0
1377 None 0
1378 None 0
1380 None 0
1381 None 0
1382 None 0
1383 None 0
1384 None 0
1385 None 0
1386 None 0
1388 None 0
1390 None 0
1392 None 0
1393 None 0
1394 None 0
1397 None 0
1398 None 0
1399 None 0
1400 None 0
1401 None 0
1402 None 0
1403 None 0
1405 None 0
1407 None 0
1408 None 0
1409 None 0
1411 None 0
1412 None 0
1413 None 0
1415 None 0
1417 None 0
1419 None 0
1422 None 0
1424 None 0
1425 None 0
1428 None 0
1431 None 0
1432 None 0
1433 None 0
1435 None 0
1437 None 0
1439 None 0
1441 None 0
1444 None 0
1446 None 0
1449 None 0
1450 None 0
1451 None 0
1454 None 0
1455 None 0
1456 None 0
1458 None 0
1459 None 0
1460 None 0
ds2_all[ds2_all$table == "train",] %>% 
  dplyr::select("Id" | contains("MasVnr")) %>%
  filter_all(any_vars( is.na(.)))  %>%  
    kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id MasVnrType MasVnrArea
235 NA NA
530 NA NA
651 NA NA
937 NA NA
974 NA NA
978 NA NA
1244 NA NA
1279 NA NA

Lareira

Similarmente, temos para se a casa possui ou não lareira:
- Fireplaces = 0 - FireplaceQu = NA
ambas as condições tem que ocorrer.

ds2_all[ds2_all$table == "train",] %>% 
  dplyr::select("Id" | contains("Fireplace")) %>%
  filter_all(any_vars( is.na(.)))  %>%  
    kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id Fireplaces FireplaceQu
1 0 NA
6 0 NA
11 0 NA
13 0 NA
16 0 NA
18 0 NA
19 0 NA
20 0 NA
27 0 NA
30 0 NA
31 0 NA
32 0 NA
33 0 NA
37 0 NA
39 0 NA
40 0 NA
43 0 NA
44 0 NA
45 0 NA
48 0 NA
49 0 NA
50 0 NA
51 0 NA
53 0 NA
57 0 NA
58 0 NA
60 0 NA
61 0 NA
62 0 NA
64 0 NA
65 0 NA
68 0 NA
69 0 NA
72 0 NA
74 0 NA
75 0 NA
76 0 NA
77 0 NA
78 0 NA
79 0 NA
80 0 NA
82 0 NA
84 0 NA
88 0 NA
89 0 NA
90 0 NA
91 0 NA
92 0 NA
93 0 NA
95 0 NA
97 0 NA
98 0 NA
99 0 NA
100 0 NA
103 0 NA
104 0 NA
107 0 NA
108 0 NA
109 0 NA
111 0 NA
118 0 NA
122 0 NA
123 0 NA
124 0 NA
126 0 NA
130 0 NA
133 0 NA
134 0 NA
138 0 NA
140 0 NA
142 0 NA
143 0 NA
144 0 NA
145 0 NA
146 0 NA
147 0 NA
149 0 NA
150 0 NA
151 0 NA
155 0 NA
156 0 NA
157 0 NA
161 0 NA
164 0 NA
165 0 NA
166 0 NA
171 0 NA
180 0 NA
184 0 NA
185 0 NA
187 0 NA
188 0 NA
192 0 NA
193 0 NA
194 0 NA
195 0 NA
199 0 NA
201 0 NA
203 0 NA
205 0 NA
206 0 NA
210 0 NA
211 0 NA
212 0 NA
214 0 NA
215 0 NA
217 0 NA
218 0 NA
220 0 NA
221 0 NA
224 0 NA
226 0 NA
228 0 NA
231 0 NA
234 0 NA
236 0 NA
237 0 NA
238 0 NA
239 0 NA
241 0 NA
242 0 NA
243 0 NA
247 0 NA
249 0 NA
251 0 NA
253 0 NA
254 0 NA
255 0 NA
257 0 NA
260 0 NA
264 0 NA
265 0 NA
275 0 NA
276 0 NA
277 0 NA
278 0 NA
282 0 NA
285 0 NA
286 0 NA
288 0 NA
289 0 NA
290 0 NA
292 0 NA
296 0 NA
297 0 NA
304 0 NA
306 0 NA
308 0 NA
309 0 NA
321 0 NA
324 0 NA
326 0 NA
328 0 NA
329 0 NA
330 0 NA
331 0 NA
332 0 NA
338 0 NA
339 0 NA
340 0 NA
341 0 NA
342 0 NA
343 0 NA
345 0 NA
347 0 NA
349 0 NA
353 0 NA
354 0 NA
356 0 NA
357 0 NA
359 0 NA
362 0 NA
364 0 NA
366 0 NA
373 0 NA
374 0 NA
376 0 NA
377 0 NA
383 0 NA
384 0 NA
387 0 NA
389 0 NA
391 0 NA
393 0 NA
395 0 NA
396 0 NA
397 0 NA
399 0 NA
400 0 NA
403 0 NA
407 0 NA
408 0 NA
411 0 NA
412 0 NA
416 0 NA
419 0 NA
421 0 NA
423 0 NA
428 0 NA
429 0 NA
431 0 NA
432 0 NA
433 0 NA
435 0 NA
437 0 NA
438 0 NA
440 0 NA
442 0 NA
450 0 NA
451 0 NA
453 0 NA
454 0 NA
455 0 NA
457 0 NA
461 0 NA
462 0 NA
465 0 NA
470 0 NA
471 0 NA
473 0 NA
475 0 NA
476 0 NA
480 0 NA
484 0 NA
485 0 NA
487 0 NA
490 0 NA
493 0 NA
495 0 NA
496 0 NA
498 0 NA
499 0 NA
500 0 NA
501 0 NA
502 0 NA
503 0 NA
506 0 NA
508 0 NA
510 0 NA
513 0 NA
514 0 NA
515 0 NA
519 0 NA
521 0 NA
527 0 NA
529 0 NA
532 0 NA
534 0 NA
536 0 NA
537 0 NA
538 0 NA
544 0 NA
546 0 NA
548 0 NA
549 0 NA
551 0 NA
552 0 NA
554 0 NA
558 0 NA
566 0 NA
568 0 NA
571 0 NA
572 0 NA
573 0 NA
575 0 NA
576 0 NA
579 0 NA
580 0 NA
583 0 NA
585 0 NA
587 0 NA
588 0 NA
590 0 NA
591 0 NA
593 0 NA
594 0 NA
595 0 NA
597 0 NA
604 0 NA
607 0 NA
608 0 NA
610 0 NA
614 0 NA
615 0 NA
616 0 NA
618 0 NA
621 0 NA
623 0 NA
626 0 NA
630 0 NA
631 0 NA
634 0 NA
635 0 NA
636 0 NA
638 0 NA
639 0 NA
644 0 NA
646 0 NA
647 0 NA
650 0 NA
651 0 NA
654 0 NA
656 0 NA
657 0 NA
660 0 NA
664 0 NA
671 0 NA
672 0 NA
677 0 NA
678 0 NA
680 0 NA
682 0 NA
685 0 NA
687 0 NA
688 0 NA
694 0 NA
695 0 NA
697 0 NA
698 0 NA
700 0 NA
702 0 NA
705 0 NA
706 0 NA
710 0 NA
711 0 NA
712 0 NA
714 0 NA
715 0 NA
717 0 NA
722 0 NA
723 0 NA
724 0 NA
726 0 NA
728 0 NA
729 0 NA
730 0 NA
735 0 NA
737 0 NA
739 0 NA
740 0 NA
741 0 NA
742 0 NA
743 0 NA
750 0 NA
751 0 NA
752 0 NA
753 0 NA
755 0 NA
756 0 NA
757 0 NA
759 0 NA
761 0 NA
762 0 NA
763 0 NA
768 0 NA
771 0 NA
772 0 NA
774 0 NA
776 0 NA
777 0 NA
780 0 NA
782 0 NA
783 0 NA
787 0 NA
788 0 NA
789 0 NA
790 0 NA
794 0 NA
798 0 NA
801 0 NA
802 0 NA
805 0 NA
806 0 NA
807 0 NA
810 0 NA
813 0 NA
814 0 NA
815 0 NA
816 0 NA
819 0 NA
821 0 NA
822 0 NA
827 0 NA
829 0 NA
830 0 NA
832 0 NA
834 0 NA
835 0 NA
836 0 NA
837 0 NA
838 0 NA
839 0 NA
840 0 NA
841 0 NA
843 0 NA
844 0 NA
851 0 NA
856 0 NA
857 0 NA
862 0 NA
863 0 NA
864 0 NA
865 0 NA
866 0 NA
868 0 NA
871 0 NA
872 0 NA
873 0 NA
875 0 NA
877 0 NA
879 0 NA
880 0 NA
881 0 NA
884 0 NA
885 0 NA
887 0 NA
888 0 NA
893 0 NA
895 0 NA
897 0 NA
898 0 NA
901 0 NA
902 0 NA
905 0 NA
906 0 NA
909 0 NA
911 0 NA
912 0 NA
913 0 NA
914 0 NA
915 0 NA
916 0 NA
917 0 NA
918 0 NA
921 0 NA
922 0 NA
926 0 NA
931 0 NA
932 0 NA
934 0 NA
936 0 NA
937 0 NA
939 0 NA
941 0 NA
943 0 NA
944 0 NA
946 0 NA
951 0 NA
952 0 NA
953 0 NA
955 0 NA
956 0 NA
958 0 NA
959 0 NA
960 0 NA
961 0 NA
964 0 NA
968 0 NA
969 0 NA
970 0 NA
971 0 NA
972 0 NA
974 0 NA
975 0 NA
976 0 NA
977 0 NA
978 0 NA
979 0 NA
980 0 NA
981 0 NA
985 0 NA
986 0 NA
987 0 NA
990 0 NA
994 0 NA
996 0 NA
997 0 NA
1000 0 NA
1001 0 NA
1002 0 NA
1004 0 NA
1006 0 NA
1008 0 NA
1010 0 NA
1012 0 NA
1014 0 NA
1016 0 NA
1021 0 NA
1022 0 NA
1023 0 NA
1026 0 NA
1030 0 NA
1031 0 NA
1034 0 NA
1036 0 NA
1040 0 NA
1042 0 NA
1048 0 NA
1050 0 NA
1051 0 NA
1061 0 NA
1062 0 NA
1063 0 NA
1066 0 NA
1068 0 NA
1070 0 NA
1071 0 NA
1072 0 NA
1074 0 NA
1075 0 NA
1078 0 NA
1080 0 NA
1081 0 NA
1082 0 NA
1086 0 NA
1087 0 NA
1089 0 NA
1091 0 NA
1092 0 NA
1094 0 NA
1095 0 NA
1097 0 NA
1098 0 NA
1099 0 NA
1101 0 NA
1102 0 NA
1103 0 NA
1105 0 NA
1113 0 NA
1114 0 NA
1115 0 NA
1118 0 NA
1119 0 NA
1120 0 NA
1122 0 NA
1123 0 NA
1124 0 NA
1130 0 NA
1132 0 NA
1133 0 NA
1137 0 NA
1138 0 NA
1141 0 NA
1144 0 NA
1145 0 NA
1150 0 NA
1154 0 NA
1159 0 NA
1164 0 NA
1166 0 NA
1167 0 NA
1173 0 NA
1177 0 NA
1178 0 NA
1181 0 NA
1186 0 NA
1187 0 NA
1189 0 NA
1192 0 NA
1193 0 NA
1194 0 NA
1195 0 NA
1196 0 NA
1199 0 NA
1201 0 NA
1202 0 NA
1205 0 NA
1207 0 NA
1208 0 NA
1209 0 NA
1212 0 NA
1213 0 NA
1214 0 NA
1215 0 NA
1216 0 NA
1217 0 NA
1218 0 NA
1219 0 NA
1220 0 NA
1221 0 NA
1224 0 NA
1226 0 NA
1228 0 NA
1230 0 NA
1232 0 NA
1233 0 NA
1234 0 NA
1235 0 NA
1236 0 NA
1237 0 NA
1239 0 NA
1241 0 NA
1242 0 NA
1247 0 NA
1248 0 NA
1250 0 NA
1253 0 NA
1258 0 NA
1259 0 NA
1260 0 NA
1262 0 NA
1265 0 NA
1266 0 NA
1270 0 NA
1273 0 NA
1276 0 NA
1280 0 NA
1281 0 NA
1283 0 NA
1284 0 NA
1288 0 NA
1292 0 NA
1293 0 NA
1295 0 NA
1296 0 NA
1297 0 NA
1298 0 NA
1300 0 NA
1304 0 NA
1305 0 NA
1308 0 NA
1309 0 NA
1310 0 NA
1312 0 NA
1320 0 NA
1322 0 NA
1324 0 NA
1326 0 NA
1327 0 NA
1332 0 NA
1334 0 NA
1335 0 NA
1337 0 NA
1338 0 NA
1339 0 NA
1340 0 NA
1341 0 NA
1342 0 NA
1346 0 NA
1350 0 NA
1351 0 NA
1353 0 NA
1357 0 NA
1359 0 NA
1361 0 NA
1365 0 NA
1366 0 NA
1367 0 NA
1369 0 NA
1371 0 NA
1377 0 NA
1378 0 NA
1379 0 NA
1380 0 NA
1381 0 NA
1383 0 NA
1384 0 NA
1385 0 NA
1386 0 NA
1391 0 NA
1392 0 NA
1398 0 NA
1399 0 NA
1404 0 NA
1405 0 NA
1407 0 NA
1408 0 NA
1409 0 NA
1411 0 NA
1412 0 NA
1413 0 NA
1417 0 NA
1419 0 NA
1423 0 NA
1426 0 NA
1432 0 NA
1433 0 NA
1437 0 NA
1439 0 NA
1445 0 NA
1446 0 NA
1447 0 NA
1449 0 NA
1450 0 NA
1451 0 NA
1453 0 NA
1454 0 NA
1455 0 NA
1459 0 NA
1460 0 NA

Garagem

Similarmente, temos para se a casa possui ou não garagem todas as condçoes a seguir devem ocorrer:
- GarageType = NA - GarageYrBlt = NA - GarageFinish = NA - GarageCars = 0 - GarageArea = 0 - GarageQual = NA - GarageCond = NA

ds2_all[ds2_all$table == "train",] %>% 
  dplyr::select("Id" | contains("Garage")) %>%
  filter_all(any_vars( is.na(.)))  %>%  
    kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
40 NA NA NA 0 0 NA NA
49 NA NA NA 0 0 NA NA
79 NA NA NA 0 0 NA NA
89 NA NA NA 0 0 NA NA
90 NA NA NA 0 0 NA NA
100 NA NA NA 0 0 NA NA
109 NA NA NA 0 0 NA NA
126 NA NA NA 0 0 NA NA
128 NA NA NA 0 0 NA NA
141 NA NA NA 0 0 NA NA
149 NA NA NA 0 0 NA NA
156 NA NA NA 0 0 NA NA
164 NA NA NA 0 0 NA NA
166 NA NA NA 0 0 NA NA
199 NA NA NA 0 0 NA NA
211 NA NA NA 0 0 NA NA
242 NA NA NA 0 0 NA NA
251 NA NA NA 0 0 NA NA
288 NA NA NA 0 0 NA NA
292 NA NA NA 0 0 NA NA
308 NA NA NA 0 0 NA NA
376 NA NA NA 0 0 NA NA
387 NA NA NA 0 0 NA NA
394 NA NA NA 0 0 NA NA
432 NA NA NA 0 0 NA NA
435 NA NA NA 0 0 NA NA
442 NA NA NA 0 0 NA NA
465 NA NA NA 0 0 NA NA
496 NA NA NA 0 0 NA NA
521 NA NA NA 0 0 NA NA
529 NA NA NA 0 0 NA NA
534 NA NA NA 0 0 NA NA
536 NA NA NA 0 0 NA NA
563 NA NA NA 0 0 NA NA
583 NA NA NA 0 0 NA NA
614 NA NA NA 0 0 NA NA
615 NA NA NA 0 0 NA NA
621 NA NA NA 0 0 NA NA
636 NA NA NA 0 0 NA NA
637 NA NA NA 0 0 NA NA
639 NA NA NA 0 0 NA NA
650 NA NA NA 0 0 NA NA
706 NA NA NA 0 0 NA NA
711 NA NA NA 0 0 NA NA
739 NA NA NA 0 0 NA NA
751 NA NA NA 0 0 NA NA
785 NA NA NA 0 0 NA NA
827 NA NA NA 0 0 NA NA
844 NA NA NA 0 0 NA NA
922 NA NA NA 0 0 NA NA
943 NA NA NA 0 0 NA NA
955 NA NA NA 0 0 NA NA
961 NA NA NA 0 0 NA NA
969 NA NA NA 0 0 NA NA
971 NA NA NA 0 0 NA NA
977 NA NA NA 0 0 NA NA
1010 NA NA NA 0 0 NA NA
1012 NA NA NA 0 0 NA NA
1031 NA NA NA 0 0 NA NA
1039 NA NA NA 0 0 NA NA
1097 NA NA NA 0 0 NA NA
1124 NA NA NA 0 0 NA NA
1132 NA NA NA 0 0 NA NA
1138 NA NA NA 0 0 NA NA
1144 NA NA NA 0 0 NA NA
1174 NA NA NA 0 0 NA NA
1180 NA NA NA 0 0 NA NA
1219 NA NA NA 0 0 NA NA
1220 NA NA NA 0 0 NA NA
1235 NA NA NA 0 0 NA NA
1258 NA NA NA 0 0 NA NA
1284 NA NA NA 0 0 NA NA
1324 NA NA NA 0 0 NA NA
1326 NA NA NA 0 0 NA NA
1327 NA NA NA 0 0 NA NA
1338 NA NA NA 0 0 NA NA
1350 NA NA NA 0 0 NA NA
1408 NA NA NA 0 0 NA NA
1450 NA NA NA 0 0 NA NA
1451 NA NA NA 0 0 NA NA
1454 NA NA NA 0 0 NA NA

Piscina

Similarmente, temos para se a casa possui ou não piscina:
- PoolArea = 0 - PoolQC = NA
ambas as condições tem que ocorrer.

ds2_all[ds2_all$table == "train",] %>% 
  dplyr::select("Id" | contains("Pool")) %>%
  filter_all(any_vars( is.na(.)))  %>%  
    kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id PoolArea PoolQC
1 0 NA
2 0 NA
3 0 NA
4 0 NA
5 0 NA
6 0 NA
7 0 NA
8 0 NA
9 0 NA
10 0 NA
11 0 NA
12 0 NA
13 0 NA
14 0 NA
15 0 NA
16 0 NA
17 0 NA
18 0 NA
19 0 NA
20 0 NA
21 0 NA
22 0 NA
23 0 NA
24 0 NA
25 0 NA
26 0 NA
27 0 NA
28 0 NA
29 0 NA
30 0 NA
31 0 NA
32 0 NA
33 0 NA
34 0 NA
35 0 NA
36 0 NA
37 0 NA
38 0 NA
39 0 NA
40 0 NA
41 0 NA
42 0 NA
43 0 NA
44 0 NA
45 0 NA
46 0 NA
47 0 NA
48 0 NA
49 0 NA
50 0 NA
51 0 NA
52 0 NA
53 0 NA
54 0 NA
55 0 NA
56 0 NA
57 0 NA
58 0 NA
59 0 NA
60 0 NA
61 0 NA
62 0 NA
63 0 NA
64 0 NA
65 0 NA
66 0 NA
67 0 NA
68 0 NA
69 0 NA
70 0 NA
71 0 NA
72 0 NA
73 0 NA
74 0 NA
75 0 NA
76 0 NA
77 0 NA
78 0 NA
79 0 NA
80 0 NA
81 0 NA
82 0 NA
83 0 NA
84 0 NA
85 0 NA
86 0 NA
87 0 NA
88 0 NA
89 0 NA
90 0 NA
91 0 NA
92 0 NA
93 0 NA
94 0 NA
95 0 NA
96 0 NA
97 0 NA
98 0 NA
99 0 NA
100 0 NA
101 0 NA
102 0 NA
103 0 NA
104 0 NA
105 0 NA
106 0 NA
107 0 NA
108 0 NA
109 0 NA
110 0 NA
111 0 NA
112 0 NA
113 0 NA
114 0 NA
115 0 NA
116 0 NA
117 0 NA
118 0 NA
119 0 NA
120 0 NA
121 0 NA
122 0 NA
123 0 NA
124 0 NA
125 0 NA
126 0 NA
127 0 NA
128 0 NA
129 0 NA
130 0 NA
131 0 NA
132 0 NA
133 0 NA
134 0 NA
135 0 NA
136 0 NA
137 0 NA
138 0 NA
139 0 NA
140 0 NA
141 0 NA
142 0 NA
143 0 NA
144 0 NA
145 0 NA
146 0 NA
147 0 NA
148 0 NA
149 0 NA
150 0 NA
151 0 NA
152 0 NA
153 0 NA
154 0 NA
155 0 NA
156 0 NA
157 0 NA
158 0 NA
159 0 NA
160 0 NA
161 0 NA
162 0 NA
163 0 NA
164 0 NA
165 0 NA
166 0 NA
167 0 NA
168 0 NA
169 0 NA
170 0 NA
171 0 NA
172 0 NA
173 0 NA
174 0 NA
175 0 NA
176 0 NA
177 0 NA
178 0 NA
179 0 NA
180 0 NA
181 0 NA
182 0 NA
183 0 NA
184 0 NA
185 0 NA
186 0 NA
187 0 NA
188 0 NA
189 0 NA
190 0 NA
191 0 NA
192 0 NA
193 0 NA
194 0 NA
195 0 NA
196 0 NA
197 0 NA
199 0 NA
200 0 NA
201 0 NA
202 0 NA
203 0 NA
204 0 NA
205 0 NA
206 0 NA
207 0 NA
208 0 NA
209 0 NA
210 0 NA
211 0 NA
212 0 NA
213 0 NA
214 0 NA
215 0 NA
216 0 NA
217 0 NA
218 0 NA
219 0 NA
220 0 NA
221 0 NA
222 0 NA
223 0 NA
224 0 NA
225 0 NA
226 0 NA
227 0 NA
228 0 NA
229 0 NA
230 0 NA
231 0 NA
232 0 NA
233 0 NA
234 0 NA
235 0 NA
236 0 NA
237 0 NA
238 0 NA
239 0 NA
240 0 NA
241 0 NA
242 0 NA
243 0 NA
244 0 NA
245 0 NA
246 0 NA
247 0 NA
248 0 NA
249 0 NA
250 0 NA
251 0 NA
252 0 NA
253 0 NA
254 0 NA
255 0 NA
256 0 NA
257 0 NA
258 0 NA
259 0 NA
260 0 NA
261 0 NA
262 0 NA
263 0 NA
264 0 NA
265 0 NA
266 0 NA
267 0 NA
268 0 NA
269 0 NA
270 0 NA
271 0 NA
272 0 NA
273 0 NA
274 0 NA
275 0 NA
276 0 NA
277 0 NA
278 0 NA
279 0 NA
280 0 NA
281 0 NA
282 0 NA
283 0 NA
284 0 NA
285 0 NA
286 0 NA
287 0 NA
288 0 NA
289 0 NA
290 0 NA
291 0 NA
292 0 NA
293 0 NA
294 0 NA
295 0 NA
296 0 NA
297 0 NA
298 0 NA
299 0 NA
300 0 NA
301 0 NA
302 0 NA
303 0 NA
304 0 NA
305 0 NA
306 0 NA
307 0 NA
308 0 NA
309 0 NA
310 0 NA
311 0 NA
312 0 NA
313 0 NA
314 0 NA
315 0 NA
316 0 NA
317 0 NA
318 0 NA
319 0 NA
320 0 NA
321 0 NA
322 0 NA
323 0 NA
324 0 NA
325 0 NA
326 0 NA
327 0 NA
328 0 NA
329 0 NA
330 0 NA
331 0 NA
332 0 NA
333 0 NA
334 0 NA
335 0 NA
336 0 NA
337 0 NA
338 0 NA
339 0 NA
340 0 NA
341 0 NA
342 0 NA
343 0 NA
344 0 NA
345 0 NA
346 0 NA
347 0 NA
348 0 NA
349 0 NA
350 0 NA
351 0 NA
352 0 NA
353 0 NA
354 0 NA
355 0 NA
356 0 NA
357 0 NA
358 0 NA
359 0 NA
360 0 NA
361 0 NA
362 0 NA
363 0 NA
364 0 NA
365 0 NA
366 0 NA
367 0 NA
368 0 NA
369 0 NA
370 0 NA
371 0 NA
372 0 NA
373 0 NA
374 0 NA
375 0 NA
376 0 NA
377 0 NA
378 0 NA
379 0 NA
380 0 NA
381 0 NA
382 0 NA
383 0 NA
384 0 NA
385 0 NA
386 0 NA
387 0 NA
388 0 NA
389 0 NA
390 0 NA
391 0 NA
392 0 NA
393 0 NA
394 0 NA
395 0 NA
396 0 NA
397 0 NA
398 0 NA
399 0 NA
400 0 NA
401 0 NA
402 0 NA
403 0 NA
404 0 NA
405 0 NA
406 0 NA
407 0 NA
408 0 NA
409 0 NA
410 0 NA
411 0 NA
412 0 NA
413 0 NA
414 0 NA
415 0 NA
416 0 NA
417 0 NA
418 0 NA
419 0 NA
420 0 NA
421 0 NA
422 0 NA
423 0 NA
424 0 NA
425 0 NA
426 0 NA
427 0 NA
428 0 NA
429 0 NA
430 0 NA
431 0 NA
432 0 NA
433 0 NA
434 0 NA
435 0 NA
436 0 NA
437 0 NA
438 0 NA
439 0 NA
440 0 NA
441 0 NA
442 0 NA
443 0 NA
444 0 NA
445 0 NA
446 0 NA
447 0 NA
448 0 NA
449 0 NA
450 0 NA
451 0 NA
452 0 NA
453 0 NA
454 0 NA
455 0 NA
456 0 NA
457 0 NA
458 0 NA
459 0 NA
460 0 NA
461 0 NA
462 0 NA
463 0 NA
464 0 NA
465 0 NA
466 0 NA
467 0 NA
468 0 NA
469 0 NA
470 0 NA
471 0 NA
472 0 NA
473 0 NA
474 0 NA
475 0 NA
476 0 NA
477 0 NA
478 0 NA
479 0 NA
480 0 NA
481 0 NA
482 0 NA
483 0 NA
484 0 NA
485 0 NA
486 0 NA
487 0 NA
488 0 NA
489 0 NA
490 0 NA
491 0 NA
492 0 NA
493 0 NA
494 0 NA
495 0 NA
496 0 NA
497 0 NA
498 0 NA
499 0 NA
500 0 NA
501 0 NA
502 0 NA
503 0 NA
504 0 NA
505 0 NA
506 0 NA
507 0 NA
508 0 NA
509 0 NA
510 0 NA
511 0 NA
512 0 NA
513 0 NA
514 0 NA
515 0 NA
516 0 NA
517 0 NA
518 0 NA
519 0 NA
520 0 NA
521 0 NA
522 0 NA
523 0 NA
524 0 NA
525 0 NA
526 0 NA
527 0 NA
528 0 NA
529 0 NA
530 0 NA
531 0 NA
532 0 NA
533 0 NA
534 0 NA
535 0 NA
536 0 NA
537 0 NA
538 0 NA
539 0 NA
540 0 NA
541 0 NA
542 0 NA
543 0 NA
544 0 NA
545 0 NA
546 0 NA
547 0 NA
548 0 NA
549 0 NA
550 0 NA
551 0 NA
552 0 NA
553 0 NA
554 0 NA
555 0 NA
556 0 NA
557 0 NA
558 0 NA
559 0 NA
560 0 NA
561 0 NA
562 0 NA
563 0 NA
564 0 NA
565 0 NA
566 0 NA
567 0 NA
568 0 NA
569 0 NA
570 0 NA
571 0 NA
572 0 NA
573 0 NA
574 0 NA
575 0 NA
576 0 NA
577 0 NA
578 0 NA
579 0 NA
580 0 NA
581 0 NA
582 0 NA
583 0 NA
584 0 NA
585 0 NA
586 0 NA
587 0 NA
588 0 NA
589 0 NA
590 0 NA
591 0 NA
592 0 NA
593 0 NA
594 0 NA
595 0 NA
596 0 NA
597 0 NA
598 0 NA
599 0 NA
600 0 NA
601 0 NA
602 0 NA
603 0 NA
604 0 NA
605 0 NA
606 0 NA
607 0 NA
608 0 NA
609 0 NA
610 0 NA
611 0 NA
612 0 NA
613 0 NA
614 0 NA
615 0 NA
616 0 NA
617 0 NA
618 0 NA
619 0 NA
620 0 NA
621 0 NA
622 0 NA
623 0 NA
624 0 NA
625 0 NA
626 0 NA
627 0 NA
628 0 NA
629 0 NA
630 0 NA
631 0 NA
632 0 NA
633 0 NA
634 0 NA
635 0 NA
636 0 NA
637 0 NA
638 0 NA
639 0 NA
640 0 NA
641 0 NA
642 0 NA
643 0 NA
644 0 NA
645 0 NA
646 0 NA
647 0 NA
648 0 NA
649 0 NA
650 0 NA
651 0 NA
652 0 NA
653 0 NA
654 0 NA
655 0 NA
656 0 NA
657 0 NA
658 0 NA
659 0 NA
660 0 NA
661 0 NA
662 0 NA
663 0 NA
664 0 NA
665 0 NA
666 0 NA
667 0 NA
668 0 NA
669 0 NA
670 0 NA
671 0 NA
672 0 NA
673 0 NA
674 0 NA
675 0 NA
676 0 NA
677 0 NA
678 0 NA
679 0 NA
680 0 NA
681 0 NA
682 0 NA
683 0 NA
684 0 NA
685 0 NA
686 0 NA
687 0 NA
688 0 NA
689 0 NA
690 0 NA
691 0 NA
692 0 NA
693 0 NA
694 0 NA
695 0 NA
696 0 NA
697 0 NA
698 0 NA
699 0 NA
700 0 NA
701 0 NA
702 0 NA
703 0 NA
704 0 NA
705 0 NA
706 0 NA
707 0 NA
708 0 NA
709 0 NA
710 0 NA
711 0 NA
712 0 NA
713 0 NA
714 0 NA
715 0 NA
716 0 NA
717 0 NA
718 0 NA
719 0 NA
720 0 NA
721 0 NA
722 0 NA
723 0 NA
724 0 NA
725 0 NA
726 0 NA
727 0 NA
728 0 NA
729 0 NA
730 0 NA
731 0 NA
732 0 NA
733 0 NA
734 0 NA
735 0 NA
736 0 NA
737 0 NA
738 0 NA
739 0 NA
740 0 NA
741 0 NA
742 0 NA
743 0 NA
744 0 NA
745 0 NA
746 0 NA
747 0 NA
748 0 NA
749 0 NA
750 0 NA
751 0 NA
752 0 NA
753 0 NA
754 0 NA
755 0 NA
756 0 NA
757 0 NA
758 0 NA
759 0 NA
760 0 NA
761 0 NA
762 0 NA
763 0 NA
764 0 NA
765 0 NA
766 0 NA
767 0 NA
768 0 NA
769 0 NA
770 0 NA
771 0 NA
772 0 NA
773 0 NA
774 0 NA
775 0 NA
776 0 NA
777 0 NA
778 0 NA
779 0 NA
780 0 NA
781 0 NA
782 0 NA
783 0 NA
784 0 NA
785 0 NA
786 0 NA
787 0 NA
788 0 NA
789 0 NA
790 0 NA
791 0 NA
792 0 NA
793 0 NA
794 0 NA
795 0 NA
796 0 NA
797 0 NA
798 0 NA
799 0 NA
800 0 NA
801 0 NA
802 0 NA
803 0 NA
804 0 NA
805 0 NA
806 0 NA
807 0 NA
808 0 NA
809 0 NA
810 0 NA
812 0 NA
813 0 NA
814 0 NA
815 0 NA
816 0 NA
817 0 NA
818 0 NA
819 0 NA
820 0 NA
821 0 NA
822 0 NA
823 0 NA
824 0 NA
825 0 NA
826 0 NA
827 0 NA
828 0 NA
829 0 NA
830 0 NA
831 0 NA
832 0 NA
833 0 NA
834 0 NA
835 0 NA
836 0 NA
837 0 NA
838 0 NA
839 0 NA
840 0 NA
841 0 NA
842 0 NA
843 0 NA
844 0 NA
845 0 NA
846 0 NA
847 0 NA
848 0 NA
849 0 NA
850 0 NA
851 0 NA
852 0 NA
853 0 NA
854 0 NA
855 0 NA
856 0 NA
857 0 NA
858 0 NA
859 0 NA
860 0 NA
861 0 NA
862 0 NA
863 0 NA
864 0 NA
865 0 NA
866 0 NA
867 0 NA
868 0 NA
869 0 NA
870 0 NA
871 0 NA
872 0 NA
873 0 NA
874 0 NA
875 0 NA
876 0 NA
877 0 NA
878 0 NA
879 0 NA
880 0 NA
881 0 NA
882 0 NA
883 0 NA
884 0 NA
885 0 NA
886 0 NA
887 0 NA
888 0 NA
889 0 NA
890 0 NA
891 0 NA
892 0 NA
893 0 NA
894 0 NA
895 0 NA
896 0 NA
897 0 NA
898 0 NA
899 0 NA
900 0 NA
901 0 NA
902 0 NA
903 0 NA
904 0 NA
905 0 NA
906 0 NA
907 0 NA
908 0 NA
909 0 NA
910 0 NA
911 0 NA
912 0 NA
913 0 NA
914 0 NA
915 0 NA
916 0 NA
917 0 NA
918 0 NA
919 0 NA
920 0 NA
921 0 NA
922 0 NA
923 0 NA
924 0 NA
925 0 NA
926 0 NA
927 0 NA
928 0 NA
929 0 NA
930 0 NA
931 0 NA
932 0 NA
933 0 NA
934 0 NA
935 0 NA
936 0 NA
937 0 NA
938 0 NA
939 0 NA
940 0 NA
941 0 NA
942 0 NA
943 0 NA
944 0 NA
945 0 NA
946 0 NA
947 0 NA
948 0 NA
949 0 NA
950 0 NA
951 0 NA
952 0 NA
953 0 NA
954 0 NA
955 0 NA
956 0 NA
957 0 NA
958 0 NA
959 0 NA
960 0 NA
961 0 NA
962 0 NA
963 0 NA
964 0 NA
965 0 NA
966 0 NA
967 0 NA
968 0 NA
969 0 NA
970 0 NA
971 0 NA
972 0 NA
973 0 NA
974 0 NA
975 0 NA
976 0 NA
977 0 NA
978 0 NA
979 0 NA
980 0 NA
981 0 NA
982 0 NA
983 0 NA
984 0 NA
985 0 NA
986 0 NA
987 0 NA
988 0 NA
989 0 NA
990 0 NA
991 0 NA
992 0 NA
993 0 NA
994 0 NA
995 0 NA
996 0 NA
997 0 NA
998 0 NA
999 0 NA
1000 0 NA
1001 0 NA
1002 0 NA
1003 0 NA
1004 0 NA
1005 0 NA
1006 0 NA
1007 0 NA
1008 0 NA
1009 0 NA
1010 0 NA
1011 0 NA
1012 0 NA
1013 0 NA
1014 0 NA
1015 0 NA
1016 0 NA
1017 0 NA
1018 0 NA
1019 0 NA
1020 0 NA
1021 0 NA
1022 0 NA
1023 0 NA
1024 0 NA
1025 0 NA
1026 0 NA
1027 0 NA
1028 0 NA
1029 0 NA
1030 0 NA
1031 0 NA
1032 0 NA
1033 0 NA
1034 0 NA
1035 0 NA
1036 0 NA
1037 0 NA
1038 0 NA
1039 0 NA
1040 0 NA
1041 0 NA
1042 0 NA
1043 0 NA
1044 0 NA
1045 0 NA
1046 0 NA
1047 0 NA
1048 0 NA
1049 0 NA
1050 0 NA
1051 0 NA
1052 0 NA
1053 0 NA
1054 0 NA
1055 0 NA
1056 0 NA
1057 0 NA
1058 0 NA
1059 0 NA
1060 0 NA
1061 0 NA
1062 0 NA
1063 0 NA
1064 0 NA
1065 0 NA
1066 0 NA
1067 0 NA
1068 0 NA
1069 0 NA
1070 0 NA
1071 0 NA
1072 0 NA
1073 0 NA
1074 0 NA
1075 0 NA
1076 0 NA
1077 0 NA
1078 0 NA
1079 0 NA
1080 0 NA
1081 0 NA
1082 0 NA
1083 0 NA
1084 0 NA
1085 0 NA
1086 0 NA
1087 0 NA
1088 0 NA
1089 0 NA
1090 0 NA
1091 0 NA
1092 0 NA
1093 0 NA
1094 0 NA
1095 0 NA
1096 0 NA
1097 0 NA
1098 0 NA
1099 0 NA
1100 0 NA
1101 0 NA
1102 0 NA
1103 0 NA
1104 0 NA
1105 0 NA
1106 0 NA
1107 0 NA
1108 0 NA
1109 0 NA
1110 0 NA
1111 0 NA
1112 0 NA
1113 0 NA
1114 0 NA
1115 0 NA
1116 0 NA
1117 0 NA
1118 0 NA
1119 0 NA
1120 0 NA
1121 0 NA
1122 0 NA
1123 0 NA
1124 0 NA
1125 0 NA
1126 0 NA
1127 0 NA
1128 0 NA
1129 0 NA
1130 0 NA
1131 0 NA
1132 0 NA
1133 0 NA
1134 0 NA
1135 0 NA
1136 0 NA
1137 0 NA
1138 0 NA
1139 0 NA
1140 0 NA
1141 0 NA
1142 0 NA
1143 0 NA
1144 0 NA
1145 0 NA
1146 0 NA
1147 0 NA
1148 0 NA
1149 0 NA
1150 0 NA
1151 0 NA
1152 0 NA
1153 0 NA
1154 0 NA
1155 0 NA
1156 0 NA
1157 0 NA
1158 0 NA
1159 0 NA
1160 0 NA
1161 0 NA
1162 0 NA
1163 0 NA
1164 0 NA
1165 0 NA
1166 0 NA
1167 0 NA
1168 0 NA
1169 0 NA
1170 0 NA
1172 0 NA
1173 0 NA
1174 0 NA
1175 0 NA
1176 0 NA
1177 0 NA
1178 0 NA
1179 0 NA
1180 0 NA
1181 0 NA
1182 0 NA
1184 0 NA
1185 0 NA
1186 0 NA
1187 0 NA
1188 0 NA
1189 0 NA
1190 0 NA
1191 0 NA
1192 0 NA
1193 0 NA
1194 0 NA
1195 0 NA
1196 0 NA
1197 0 NA
1198 0 NA
1199 0 NA
1200 0 NA
1201 0 NA
1202 0 NA
1203 0 NA
1204 0 NA
1205 0 NA
1206 0 NA
1207 0 NA
1208 0 NA
1209 0 NA
1210 0 NA
1211 0 NA
1212 0 NA
1213 0 NA
1214 0 NA
1215 0 NA
1216 0 NA
1217 0 NA
1218 0 NA
1219 0 NA
1220 0 NA
1221 0 NA
1222 0 NA
1223 0 NA
1224 0 NA
1225 0 NA
1226 0 NA
1227 0 NA
1228 0 NA
1229 0 NA
1230 0 NA
1231 0 NA
1232 0 NA
1233 0 NA
1234 0 NA
1235 0 NA
1236 0 NA
1237 0 NA
1238 0 NA
1239 0 NA
1240 0 NA
1241 0 NA
1242 0 NA
1243 0 NA
1244 0 NA
1245 0 NA
1246 0 NA
1247 0 NA
1248 0 NA
1249 0 NA
1250 0 NA
1251 0 NA
1252 0 NA
1253 0 NA
1254 0 NA
1255 0 NA
1256 0 NA
1257 0 NA
1258 0 NA
1259 0 NA
1260 0 NA
1261 0 NA
1262 0 NA
1263 0 NA
1264 0 NA
1265 0 NA
1266 0 NA
1267 0 NA
1268 0 NA
1269 0 NA
1270 0 NA
1271 0 NA
1272 0 NA
1273 0 NA
1274 0 NA
1275 0 NA
1276 0 NA
1277 0 NA
1278 0 NA
1279 0 NA
1280 0 NA
1281 0 NA
1282 0 NA
1283 0 NA
1284 0 NA
1285 0 NA
1286 0 NA
1287 0 NA
1288 0 NA
1289 0 NA
1290 0 NA
1291 0 NA
1292 0 NA
1293 0 NA
1294 0 NA
1295 0 NA
1296 0 NA
1297 0 NA
1298 0 NA
1300 0 NA
1301 0 NA
1302 0 NA
1303 0 NA
1304 0 NA
1305 0 NA
1306 0 NA
1307 0 NA
1308 0 NA
1309 0 NA
1310 0 NA
1311 0 NA
1312 0 NA
1313 0 NA
1314 0 NA
1315 0 NA
1316 0 NA
1317 0 NA
1318 0 NA
1319 0 NA
1320 0 NA
1321 0 NA
1322 0 NA
1323 0 NA
1324 0 NA
1325 0 NA
1326 0 NA
1327 0 NA
1328 0 NA
1329 0 NA
1330 0 NA
1331 0 NA
1332 0 NA
1333 0 NA
1334 0 NA
1335 0 NA
1336 0 NA
1337 0 NA
1338 0 NA
1339 0 NA
1340 0 NA
1341 0 NA
1342 0 NA
1343 0 NA
1344 0 NA
1345 0 NA
1346 0 NA
1347 0 NA
1348 0 NA
1349 0 NA
1350 0 NA
1351 0 NA
1352 0 NA
1353 0 NA
1354 0 NA
1355 0 NA
1356 0 NA
1357 0 NA
1358 0 NA
1359 0 NA
1360 0 NA
1361 0 NA
1362 0 NA
1363 0 NA
1364 0 NA
1365 0 NA
1366 0 NA
1367 0 NA
1368 0 NA
1369 0 NA
1370 0 NA
1371 0 NA
1372 0 NA
1373 0 NA
1374 0 NA
1375 0 NA
1376 0 NA
1377 0 NA
1378 0 NA
1379 0 NA
1380 0 NA
1381 0 NA
1382 0 NA
1383 0 NA
1384 0 NA
1385 0 NA
1386 0 NA
1388 0 NA
1389 0 NA
1390 0 NA
1391 0 NA
1392 0 NA
1393 0 NA
1394 0 NA
1395 0 NA
1396 0 NA
1397 0 NA
1398 0 NA
1399 0 NA
1400 0 NA
1401 0 NA
1402 0 NA
1403 0 NA
1404 0 NA
1405 0 NA
1406 0 NA
1407 0 NA
1408 0 NA
1409 0 NA
1410 0 NA
1411 0 NA
1412 0 NA
1413 0 NA
1414 0 NA
1415 0 NA
1416 0 NA
1417 0 NA
1418 0 NA
1419 0 NA
1420 0 NA
1421 0 NA
1422 0 NA
1423 0 NA
1425 0 NA
1426 0 NA
1427 0 NA
1428 0 NA
1429 0 NA
1430 0 NA
1431 0 NA
1432 0 NA
1433 0 NA
1434 0 NA
1435 0 NA
1436 0 NA
1437 0 NA
1438 0 NA
1439 0 NA
1440 0 NA
1441 0 NA
1442 0 NA
1443 0 NA
1444 0 NA
1445 0 NA
1446 0 NA
1447 0 NA
1448 0 NA
1449 0 NA
1450 0 NA
1451 0 NA
1452 0 NA
1453 0 NA
1454 0 NA
1455 0 NA
1456 0 NA
1457 0 NA
1458 0 NA
1459 0 NA
1460 0 NA

Ajuste e criação de Dummies

# mudando NA para 0 pensando no futuro, assumindo que o R vai jogar um dos fatores (geralmente em ordem alfabética) como base quando calculamos a regressão. E, no caso, NA para essa variável não é valor faltante, é uma resposta válidade.

  # Alley
ds2_all$Alley_ <- ifelse(is.na(ds2_all$Alley), 0, ds2_all$Alley)
#ds2_all[c(22:32),c("Alley","Alley_")]


  # BsmtQual / BsmtCond / BsmtExposure / BsmtFinType1 / BsmtFinType2
# existência de porão
ds2_all$BsmtDummy_ <- ifelse(is.na(ds2_all$BsmtQual) &
                               is.na(ds2_all$BsmtCond) &
                               is.na(ds2_all$BsmtExposure) &
                               is.na(ds2_all$BsmtFinType1) &
                               ds2_all$BsmtFinSF1 == 0 &
                               is.na(ds2_all$BsmtFinType2) &
                               ds2_all$BsmtFinSF2 == 0 &
                               ds2_all$BsmtUnfSF == 0 &
                               ds2_all$TotalBsmtSF == 0, 0, 1) #binary

ds2_all$BsmtQual_ <- ifelse(ds2_all$BsmtDummy_ == 0 & is.na(ds2_all$BsmtQual), 0, ds2_all$BsmtQual)
#ds2_all[c(15:25),c("BsmtQual","BsmtQual_")]
ds2_all$BsmtCond_ <- ifelse(ds2_all$BsmtDummy_ == 0 & is.na(ds2_all$BsmtCond), 0, ds2_all$BsmtCond)
ds2_all$BsmtExposure_ <- ifelse(ds2_all$BsmtDummy_ == 0 & is.na(ds2_all$BsmtExposure), 0, ds2_all$BsmtExposure)
ds2_all$BsmtFinType1_ <- ifelse(ds2_all$BsmtDummy_ == 0 & is.na(ds2_all$BsmtFinType1), 0, ds2_all$BsmtFinType1)
ds2_all$BsmtFinType2_ <- ifelse(ds2_all$BsmtDummy_ == 0 & is.na(ds2_all$BsmtFinType2), 0, ds2_all$BsmtFinType2)


  # MasVnrTyp
# existência de alvenaria
ds2_all$MasVnrDummy_ <- ifelse(ds2_all$MasVnrType == "None" & 
                                    ds2_all$MasVnrArea == 0, 0, 1) #binary
ds2_all$MasVnrType_ <- ifelse(ds2_all$MasVnrType == "None" , 0, ds2_all$MasVnrType)

  # FireplaceQu
# existência de lareira
ds2_all$FirePlaceDummy_ <- ifelse(ds2_all$Fireplaces == 0 & 
                                    is.na(ds2_all$FireplaceQu), 0, 1) #binary
ds2_all$FireplaceQu_ <- ifelse(ds2_all$FirePlaceDummy_ == 0 & 
                                 is.na(ds2_all$FireplaceQu), 0, ds2_all$FireplaceQu)


  # GarageType / GarageFinish / GarageQual / GarageCond 
# existência de garagem
ds2_all$GarageDummy_ <- ifelse(is.na(ds2_all$GarageType) &
                                 is.na(ds2_all$GarageYrBlt) &
                                 is.na(ds2_all$GarageFinish) &
                                 ds2_all$GarageCars == 0 &
                                 ds2_all$GarageArea == 0 &
                                 is.na(ds2_all$GarageQual) &
                                 is.na(ds2_all$GarageCond) , 0, 1) #binary
ds2_all$GarageType_ <- ifelse(ds2_all$GarageDummy_ == 0 & is.na(ds2_all$GarageType), 0, ds2_all$GarageType)
ds2_all$GarageFinish_ <- ifelse(ds2_all$GarageDummy_ == 0 & is.na(ds2_all$GarageFinish), 0, ds2_all$GarageFinish)
ds2_all$GarageQual_ <- ifelse(ds2_all$GarageDummy_ == 0 & is.na(ds2_all$GarageQual), 0, ds2_all$GarageQual)
ds2_all$GarageCond_ <- ifelse(ds2_all$GarageDummy_ == 0 & is.na(ds2_all$GarageCond), 0, ds2_all$GarageCond)


  # PoolQC
# existência de piscina
ds2_all$PoolDummy_ <- ifelse(ds2_all$PoolArea == 0 & is.na(ds2_all$PoolQC), 0, 1) #binary
ds2_all$PoolQC_ <- ifelse(ds2_all$PoolDummy_ == 0 & is.na(ds2_all$PoolQC), 0, ds2_all$PoolQC)


  # Fence
# existência de cerca
ds2_all$FenceDummy_ <- ifelse(is.na(ds2_all$Fence), 0, 1) #binary
ds2_all$Fence_ <- ifelse(is.na(ds2_all$Fence), 0, ds2_all$Fence)


  # MiscFeature
# existência de coisas variadas
ds2_all$MiscFeatureDummy_ <- ifelse(is.na(ds2_all$MiscFeature), 0, 1) #binary
ds2_all$MiscFeature_ <- ifelse(is.na(ds2_all$MiscFeature), 0, ds2_all$MiscFeature)


  # CentralAir
ds2_all <- ds2_all %>%
  mutate(
    CentralAir_ = as.factor(case_when(
      CentralAir == "Y"  ~ "Yes",
      CentralAir == "N" ~ "No"
  ))
)

Criação de fatores

# Transformando as variáveis numericas que são fatores em fatores.

# tipo de moradia
ds2_all$MSSubClass_ <- as.factor(ds2_all$MSSubClass)


### Likert 10
likert_10levels <- factor(c(1:10), ordered = TRUE)

# escala de Likert - OverallQual
ds2_all$OverallQual_ <- factor(ds2_all$OverallQual, levels = likert_10levels, ordered = TRUE)

# escala de Likert - OverallCond
ds2_all$OverallCond_ <- factor(ds2_all$OverallCond, levels = likert_10levels, ordered = TRUE)


### Likert 5
likert_5levels_a <- factor(c(1:5), labels = c("Po", "Fa", "TA", "Gd", "Ex") , ordered = TRUE)
likert_5levels_b <- factor(c(0:4), labels = c("0", "No", "Mn", "Av", "Gd") , ordered = TRUE)

# escala de Likert A - ExterQual
ds2_all$ExterQual_ <- factor(ds2_all$ExterQual, levels = likert_5levels_a, ordered = TRUE)

# escala de Likert A - ExterCond
ds2_all$ExterCond_ <- factor(ds2_all$ExterCond, levels = likert_5levels_a, ordered = TRUE)

# escala de Likert B - BsmtExposure_
ds2_all$BsmtExposure_ <- factor(ds2_all$BsmtExposure_, levels = likert_5levels_b, ordered = TRUE)

# escala de Likert A - HeatingQC
ds2_all$HeatingQC_ <- factor(ds2_all$HeatingQC, levels = likert_5levels_a, ordered = TRUE)

# escala de Likert A - KitchenQual
ds2_all$KitchenQual_ <- factor(ds2_all$KitchenQual, levels = likert_5levels_a, ordered = TRUE)


### Likert 6
likert_6levels <- factor(c(0:5), labels = c("0","Po", "Fa", "TA", "Gd", "Ex") , ordered = TRUE)

# escala de Likert - BsmtQual_
ds2_all$BsmtQual_ <- factor(ds2_all$BsmtQual_, levels = likert_6levels, ordered = TRUE)

# escala de Likert - BsmtCond_
ds2_all$BsmtCond_ <- factor(ds2_all$BsmtCond_, levels = likert_6levels, ordered = TRUE)

# escala de Likert - FireplaceQu_
ds2_all$FireplaceQu_ <- factor(ds2_all$FireplaceQu_, levels = likert_6levels, ordered = TRUE)

# escala de Likert - GarageQual_
ds2_all$GarageQual_ <- factor(ds2_all$GarageQual_, levels = likert_6levels, ordered = TRUE)

# escala de Likert - GarageCond_
ds2_all$GarageCond_ <- factor(ds2_all$GarageCond_, levels = likert_6levels, ordered = TRUE)

# escala de Likert - PoolQC_
ds2_all$PoolQC_ <- factor(ds2_all$PoolQC_, levels = likert_6levels, ordered = TRUE)


### Likert 7
likert_7levels <- factor(c(0:6), labels = c("0", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ") , ordered = TRUE)

# escala de Likert - BsmtFinType1_
ds2_all$BsmtFinType1_ <- factor(ds2_all$BsmtFinType1_, levels = likert_7levels, ordered = TRUE)

# escala de Likert - BsmtFinType2_
ds2_all$BsmtFinType2_ <- factor(ds2_all$BsmtFinType2_, levels = likert_7levels, ordered = TRUE)

Summary

Após feitas as mudanças, podemos ver que o número de valores faltantes reais nas variáveis categoricas diminutiam drasticamente quando comparado com as variáveis originais.

#h_train1_summary1 <- as.data.frame(which( colSums( is.na(ds2_all[ds2_all$table == "train",]) ) > 0 ))
#colnames(h_train1_summary1) <- c("Missing Observations")


h_train1_summary1 <- as.data.frame(t(ds2_all[ds2_all$table == "train",] %>% 
  summarise(across(everything(), ~ sum(is.na(.x))))))
colnames(h_train1_summary1) <- c("Missing Observations")

h_train1_summary1 %>% arrange(rownames(h_train1_summary1)) %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Missing Observations
1stFlrSF 0
2ndFlrSF 0
3SsnPorch 0
Alley 1369
Alley_ 0
BedroomAbvGr 0
BldgType 0
BsmtCond 37
BsmtCond_ 0
BsmtDummy_ 0
BsmtExposure 38
BsmtExposure_ 1
BsmtFinSF1 0
BsmtFinSF2 0
BsmtFinType1 37
BsmtFinType1_ 0
BsmtFinType2 38
BsmtFinType2_ 1
BsmtFullBath 0
BsmtHalfBath 0
BsmtQual 37
BsmtQual_ 0
BsmtUnfSF 0
CentralAir 0
CentralAir_ 0
Condition1 0
Condition2 0
Electrical 1
EnclosedPorch 0
ExterCond 0
ExterCond_ 0
Exterior1st 0
Exterior2nd 0
ExterQual 0
ExterQual_ 0
Fence 1179
Fence_ 0
FenceDummy_ 0
FirePlaceDummy_ 0
FireplaceQu 690
FireplaceQu_ 0
Fireplaces 0
Foundation 0
FullBath 0
Functional 0
GarageArea 0
GarageCars 0
GarageCond 81
GarageCond_ 0
GarageDummy_ 0
GarageFinish 81
GarageFinish_ 0
GarageQual 81
GarageQual_ 0
GarageType 81
GarageType_ 0
GarageYrBlt 81
GrLivArea 0
HalfBath 0
Heating 0
HeatingQC 0
HeatingQC_ 0
HouseStyle 0
Id 0
KitchenAbvGr 0
KitchenQual 0
KitchenQual_ 0
LandContour 0
LandSlope 0
LotArea 0
LotConfig 0
LotFrontage 259
LotShape 0
LowQualFinSF 0
MasVnrArea 8
MasVnrDummy_ 8
MasVnrType 8
MasVnrType_ 8
MiscFeature 1406
MiscFeature_ 0
MiscFeatureDummy_ 0
MiscVal 0
MoSold 0
MSSubClass 0
MSSubClass_ 0
MSZoning 0
Neighborhood 0
OpenPorchSF 0
OverallCond 0
OverallCond_ 0
OverallQual 0
OverallQual_ 0
PavedDrive 0
PoolArea 0
PoolDummy_ 0
PoolQC 1453
PoolQC_ 0
RoofMatl 0
RoofStyle 0
SaleCondition 0
SalePrice 0
SaleType 0
ScreenPorch 0
Street 0
table 0
TotalBsmtSF 0
TotRmsAbvGrd 0
Utilities 0
WoodDeckSF 0
YearBuilt 0
YearRemodAdd 0
YrSold 0

Resumo dos Dados

Variável Dependente

Sabemos que nossa variável dependente SalePrice é quantitativa contínua, então podemos fazeruma análise inicial dela.

# Histograma
h_train1_chart_resp_1 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = SalePrice)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20) +
  ggtitle("Histograma do Preço de Venda da casa") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("Preço Venda") +
  ylab("Frequência")

#ggplotly(h_train1_chart_resp_1)


# Histograma e Densidade
h_train1_chart_resp_2 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = SalePrice)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20,
                 aes(y = (..count..) / sum(..count..))) +
  geom_density(col = 2, size = 1, aes(y = 20 * (..count..) /  sum(..count..))) +
  ggtitle("Histograma e curva de densidade do Preço de Venda da casa") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("Preço Venda") +
  ylab("Frequência relativa")

#ggplotly(h_train1_chart_resp_2)

# QQPlot
h_train1_chart_resp_3 <- ggplot(ds2_all[ds2_all$table == "train",], aes(sample = SalePrice)) +
  stat_qq_point(size = 2, color = "lightblue", pch=19) +
  stat_qq_line(color="red", size=0.5) +
  ggtitle("Normal Q-Q Plot do Preço de Venda da casa") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("Theoretical") + ylab("Sample")

#ggplotly(h_train1_chart_resp_3)

grid.arrange(h_train1_chart_resp_1,
             h_train1_chart_resp_2,
             h_train1_chart_resp_3,
             nrow = 2,
             ncol = 2)

Visualmente, temos que essa variável é assimétrica à direita (positiva).

# verificando o nivel de assimetria
skewness(ds2_all[ds2_all$table == "train",]$SalePrice)
## [1] 1.879009

O cálculo feito acima comprova que de fato a variável é assimétrica à direita.

# verificando a normalidade da variável
shapiro.test(ds2_all[ds2_all$table == "train",]$SalePrice)
## 
##  Shapiro-Wilk normality test
## 
## data:  ds2_all[ds2_all$table == "train", ]$SalePrice
## W = 0.86967, p-value < 0.00000000000000022

Acima fizemos o Teste Shapiro-Wilk para Normalidade dessa variável. As hipóteses do teste são:

\[ \begin{cases} H_0: \text{ A amostra veio de uma população normalmente distribuída} \\ H_1: \text{ A amostra não veio de uma população normalmente distribuída} \end{cases} \]

e como p-valor < 5% o teste rejeita \(H_0\), indicando que a amostra de preços de venda não veio de uma população normal. Isso nos causar problemas mais a frente quando formos definir o tipo de modelagem que venhamos a fazer, já que para a regressão linear, um dos pressupostos é que as variáveis sejam normais.
Um jeito de resolver esse problema seria usar o log desse valor.

ds2_all$SalePrice_log <- log(ds2_all$SalePrice)
# Histograma
h_train1_chart_resp_4 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = SalePrice_log)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20) +
  ggtitle("Histograma do log Preço de Venda da casa") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("log Preço Venda") +
  ylab("Frequência")

#ggplotly(h_train1_chart_resp_4)


# Histograma e Densidade
h_train1_chart_resp_5 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = SalePrice_log)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20,
                 aes(y = (..count..) / sum(..count..))) +
  geom_density(col = 2, size = 1, aes(y = 20 * (..count..) /  sum(..count..))) +
  ggtitle("Histograma e curva de densidade do log Preço de Venda da casa") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("log Preço Venda") +
  ylab("Frequência relativa")

#ggplotly(h_train1_chart_resp_5)

# QQPlot
h_train1_chart_resp_6 <- ggplot(ds2_all[ds2_all$table == "train",], aes(sample = SalePrice_log)) +
  stat_qq_point(size = 2, color = "lightblue", pch=19) +
  stat_qq_line(color="red", size=0.5) +
  ggtitle("Normal Q-Q Plot do log Preço de Venda da casa") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("Theoretical") + ylab("Sample")

#ggplotly(h_train1_chart_resp_6)

grid.arrange(h_train1_chart_resp_4,
             h_train1_chart_resp_5,
             h_train1_chart_resp_6,
             nrow = 2,
             ncol = 2)

# verificando o nivel de assimetria
skewness(ds2_all[ds2_all$table == "train",]$SalePrice_log)
## [1] 0.1210859
# verificando a normalidade da variável
shapiro.test(ds2_all[ds2_all$table == "train",]$SalePrice_log)
## 
##  Shapiro-Wilk normality test
## 
## data:  ds2_all[ds2_all$table == "train", ]$SalePrice_log
## W = 0.99121, p-value = 0.0000001149

O teste de Shapiro-Wilk continua rejeitando o log da variável. Mais a frente vamos verificar novamente após retirarmos outliers, pois podem ser esses valores extremos mexendo com a normalidade da variável resposta. Foi possivel notar que o log do preço tem um p-valor maior do que somente o preço, indicando que usar o log aqui é realmente o caminho certo.

Abaixo temos algumas das estatísticas descritivas das variáveis numéricas do banco de dados. Não necessariamente são variáveis quantitavivas. O objetivo aqui é ver se há valores descrepantes dos demais.

Train

h_train1_summary2 <- ds2_all[ds2_all$table == "train",] %>% 
  summarise(across(
    .cols = where(is.numeric),
    .fns = list(
      ~ sum(is.na(.x)),
      min = min,
      median = median, 
      mean = mean, 
      max = max, 
      sd = sd
    ), 
    na.rm = TRUE,
    .names = "{.col}_{.fn}")
    ) %>% 
  pivot_longer(cols = everything()) %>% 
  mutate(statistic = str_match(name, pattern = ".+_(.+)")[,2],
         name = str_match(name, pattern = "(.+)_.+")[,2]) %>% 
  pivot_wider(names_from = name, values_from = value)

h_train1_summary2 %>% kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
statistic Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice BsmtDummy_ MasVnrDummy_ FirePlaceDummy_ GarageDummy_ PoolDummy_ FenceDummy_ MiscFeatureDummy_ SalePrice_log
1 0.00 0.00000 259.00000 0.000 0.000000 0.000000 0.0000 0.00000 8.0000 0.0000 0.00000 0.0000 0.0000 0.0000 0.0000 0.00000 0.0000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 0.0000000 81.00000 0.000000 0.0000 0.00000 0.00000 0.00000 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 0.0 0.0000000 8.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
min 1.00 20.00000 21.00000 1300.000 1.000000 1.000000 1872.0000 1950.00000 0.0000 0.0000 0.00000 0.0000 0.0000 334.0000 0.0000 0.00000 334.0000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 2.000000 0.0000000 1900.00000 0.000000 0.0000 0.00000 0.00000 0.00000 0.000000 0.00000 0.000000 0.00000 1.000000 2006.000000 34900.0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 10.4602421
median 730.50 50.00000 69.00000 9478.500 6.000000 5.000000 1973.0000 1994.00000 0.0000 383.5000 0.00000 477.5000 991.5000 1087.0000 0.0000 0.00000 1464.0000 0.0000000 0.0000000 2.0000000 0.0000000 3.000000 1.0000000 6.000000 1.0000000 1980.00000 2.000000 480.0000 0.00000 25.00000 0.00000 0.000000 0.00000 0.000000 0.00000 6.000000 2008.000000 163000.0 1.0000000 0.0000000 1.0000000 1.0000000 0.0000000 0.0000000 0.0000000 12.0015055
mean 730.50 56.89726 70.04996 10516.828 6.099315 5.575342 1971.2678 1984.86575 103.6853 443.6397 46.54932 567.2404 1057.4295 1162.6267 346.9925 5.84452 1515.4637 0.4253425 0.0575342 1.5650685 0.3828767 2.866438 1.0465753 6.517808 0.6130137 1978.50616 1.767123 472.9801 94.24452 46.66027 21.95411 3.409589 15.06096 2.758904 43.48904 6.321918 2007.815753 180921.2 0.9746575 0.4084022 0.5273973 0.9445205 0.0047945 0.1924658 0.0369863 12.0240509
max 1460.00 190.00000 313.00000 215245.000 10.000000 9.000000 2010.0000 2010.00000 1600.0000 5644.0000 1474.00000 2336.0000 6110.0000 4692.0000 2065.0000 572.00000 5642.0000 3.0000000 2.0000000 3.0000000 2.0000000 8.000000 3.0000000 14.000000 3.0000000 2010.00000 4.000000 1418.0000 857.00000 547.00000 552.00000 508.000000 480.00000 738.000000 15500.00000 12.000000 2010.000000 755000.0 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 13.5344730
sd 421.61 42.30057 24.28475 9981.265 1.382996 1.112799 30.2029 20.64541 181.0662 456.0981 161.31927 441.8670 438.7053 386.5877 436.5284 48.62308 525.4804 0.5189106 0.2387526 0.5509158 0.5028854 0.815778 0.2203382 1.625393 0.6446664 24.68972 0.747315 213.8048 125.33879 66.25603 61.11915 29.317331 55.75742 40.177307 496.12302 2.703626 1.328095 79442.5 0.1572169 0.4917076 0.4994199 0.2289921 0.0691000 0.3943719 0.1887928 0.3994519

Obs: Na tabela acima a linha statistic = 1 representa a quantidade de valores faltantes no banco de treino.

Note que de todas as variáveis numéricas do banco,
- MasVnrArea: contém 8 valores faltantes dos 1460 valores totais da base de treino, representando 0.5 %.
- LotFrontage: contém 259 valores faltantes dos 1460 valores totais da base de treino, representando 17.7 %.

Test

h_test1_summary2 <- ds2_all[ds2_all$table == "test",] %>% 
  summarise(across(
    .cols = where(is.numeric),
    .fns = list(
      ~ sum(is.na(.x)),
      min = min,
      median = median, 
      mean = mean, 
      max = max, 
      sd = sd
    ), 
    na.rm = TRUE,
    .names = "{.col}_{.fn}")
    ) %>% 
  pivot_longer(cols = everything()) %>% 
  mutate(statistic = str_match(name, pattern = ".+_(.+)")[,2],
         name = str_match(name, pattern = "(.+)_.+")[,2]) %>% 
  pivot_wider(names_from = name, values_from = value)

h_test1_summary2 %>% 
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
statistic Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice BsmtDummy_ MasVnrDummy_ FirePlaceDummy_ GarageDummy_ PoolDummy_ FenceDummy_ MiscFeatureDummy_ SalePrice_log
1 0.0000 0.00000 227.00000 0.000 0.000000 0.000000 0.00000 0.00000 15.0000 1.0000 1.00000 1.0000 1.0000 0.0000 0.0000 0.000000 0.0000 2.0000000 2.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 78.00000 1.0000000 1.0000 0.00000 0.00000 0.00000 0.00000 0.00000 0.000000 0.00000 0.000000 0.00000 0.0 1.0000000 15.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
min 1461.0000 20.00000 21.00000 1470.000 1.000000 1.000000 1879.00000 1950.00000 0.0000 0.0000 0.00000 0.0000 0.0000 407.0000 0.0000 0.000000 407.0000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 3.000000 0.0000000 1895.00000 0.0000000 0.0000 0.00000 0.00000 0.00000 0.00000 0.00000 0.000000 0.00000 1.000000 2006.00000 135751.3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 11.8185800
median 2190.0000 50.00000 67.00000 9399.000 6.000000 5.000000 1973.00000 1992.00000 0.0000 350.5000 0.00000 460.0000 988.0000 1079.0000 0.0000 0.000000 1432.0000 0.0000000 0.0000000 2.0000000 0.0000000 3.0000000 1.0000000 6.000000 0.0000000 1979.00000 2.0000000 480.0000 0.00000 28.00000 0.00000 0.00000 0.00000 0.000000 0.00000 6.000000 2008.00000 179208.7 1.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 12.0963061
mean 2190.0000 57.37834 68.58036 9819.161 6.078821 5.553804 1971.35778 1983.66278 100.7091 439.2037 52.61934 554.2949 1046.1180 1156.5346 325.9678 3.543523 1486.0459 0.4344544 0.0652025 1.5709390 0.3776559 2.8540096 1.0424949 6.385195 0.5812200 1977.72122 1.7661180 472.7689 93.17478 48.31391 24.24332 1.79438 17.06443 1.744345 58.16792 6.104181 2007.76971 179183.9 0.9718793 0.3933518 0.4996573 0.9479095 0.0041124 0.1987663 0.0349554 12.0920649
max 2919.0000 190.00000 200.00000 56600.000 10.000000 9.000000 2010.00000 2010.00000 1290.0000 4010.0000 1526.00000 2140.0000 5095.0000 5095.0000 1862.0000 1064.000000 5095.0000 3.0000000 2.0000000 4.0000000 2.0000000 6.0000000 2.0000000 15.000000 4.0000000 2207.00000 5.0000000 1488.0000 1424.00000 742.00000 1012.00000 360.00000 576.00000 800.000000 17000.00000 12.000000 2010.00000 281644.0 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 12.5483991
sd 421.3213 42.74688 22.37684 4955.517 1.436812 1.113740 30.39007 21.13047 177.6259 455.2680 176.75393 437.2605 442.8986 398.1658 420.6102 44.043251 485.5661 0.5306475 0.2524683 0.5551899 0.5030167 0.8297884 0.2084717 1.508895 0.6474205 26.43117 0.7759451 217.0486 127.74488 68.88336 67.22777 20.20784 56.60976 30.491646 630.80698 2.722432 1.30174 16518.3 0.1653744 0.4886630 0.5001713 0.2222857 0.0640180 0.3992086 0.1837300 0.0900529

Obs: Na tabela acima a linha statistic = 1 representa a quantidade de valores faltantes no banco de test.

Note que de todas as variáveis numéricas do banco,
- MasVnrArea: contém 15 valores faltantes dos 1459 valores totais da base de treino, representando 1 %.
- LotFrontage: contém 227 valores faltantes dos 1459 valores totais da base de treino, representando 15.6 %.
- BsmtFinSF1: contém 1 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- BsmtFinSF2: contém 1 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- BsmtUnfSF: contém 1 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- TotalBsmtSF: contém 1 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- BsmtFullBath: contém 2 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- BsmtHalfBath: contém 2 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- GarageYrBlt: contém 78 valores faltantes dos 1459 valores totais da base de treino, representando 5.3 %.
- GarageCars: contém 1 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- GarageArea: contém 1 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- BsmtDummy_: contém 1 valores faltantes dos 1459 valores totais da base de treino, representando 0.1 %.
- MasVnrDummy_: contém 15 valores faltantes dos 1459 valores totais da base de treino, representando 1 %.


Note que a base de teste tem valores faltantes em mais variáveis diferentes do que a base de treino, mas ainda sim, a que mais chama atenção é LotFrontage.
Como a variável LotFrontage tem muitos valores faltantes, a simplesmente remoção não seria uma boa alternativa. Podemos pensar em fazer uma regressão simples nessa variável, se encontrarmos uma que seja correlacionada com ela e ajustar esses valores faltantes.

Regressão para a variável LotFrontage

# Histograma
h_train1_chart_LF_1 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = LotFrontage)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20) +
  ggtitle("Histograma da variável LotFrontage") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("LotFrontage (ft²)") +
  ylab("Frequência")

#ggplotly(h_train1_chart_LF_1)


# Histograma e Densidade
h_train1_chart_LF_2 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = LotFrontage)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20,
                 aes(y = (..count..) / sum(..count..))) +
  geom_density(col = 2, size = 1, aes(y = 20 * (..count..) /  sum(..count..))) +
  ggtitle("Histograma e curva de densidade da variável LotFrontage") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("LotFrontage (ft²)") +
  ylab("Frequência relativa")

#ggplotly(h_train1_chart_LF_2)

# QQPlot
h_train1_chart_LF_3 <- ggplot(ds2_all[ds2_all$table == "train",], aes(sample = LotFrontage)) +
  stat_qq_point(size = 2, color = "lightblue", pch=19) +
  stat_qq_line(color="red", size=0.5) +
  ggtitle("Normal Q-Q Plot da variável LotFrontage") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("Theoretical") + ylab("Sample")

#ggplotly(h_train1_chart_LF_3)

#qqnorm(ds2_all[ds2_all$table == "train",]$LotFrontage)

grid.arrange(h_train1_chart_LF_1,
             h_train1_chart_LF_2,
             h_train1_chart_LF_3,
             nrow = 2,
             ncol = 2)

De maneira similar, temos que LotFrontage é também assimétrica à direita (positiva), e como vimos, o logaritmo pode resolver esse problema.

ds2_all$LotFrontage_log <- log(ds2_all$LotFrontage)
ds2_all$LotArea_log <- log(ds2_all$LotArea)
# Histograma
h_train1_chart_LF_4 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = LotFrontage_log)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20) +
  ggtitle("Histograma da variável LotFrontage_log") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("log(LotFrontage (ft²))") +
  ylab("Frequência")

#ggplotly(h_train1_chart_LF_4)


# Histograma e Densidade
h_train1_chart_LF_5 <- ggplot(ds2_all[ds2_all$table == "train",] , aes(x = LotFrontage_log)) +
  geom_histogram(color = "black",
                 fill = "lightblue",
                 bins = 20,
                 aes(y = (..count..) / sum(..count..))) +
  geom_density(col = 2, size = 1, aes(y = 20 * (..count..) /  sum(..count..))) +
  ggtitle("Histograma e curva de densidade da variável LotFrontage_log") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("log(LotFrontage (ft²))") +
  ylab("Frequência relativa")

#ggplotly(h_train1_chart_LF_5)

# QQPlot
h_train1_chart_LF_6 <- ggplot(ds2_all[ds2_all$table == "train",], aes(sample = LotFrontage_log)) +
  stat_qq_point(size = 2, color = "lightblue", pch=19) +
  stat_qq_line(color="red", size=0.5) +
  ggtitle("Normal Q-Q Plot da variável LotFrontage_log") +
  theme(plot.title = element_text(size = 7, face = "bold")) +
  xlab("Theoretical") + ylab("Sample")

#ggplotly(h_train1_chart_LF_6)

grid.arrange(h_train1_chart_LF_4,
             h_train1_chart_LF_5,
             h_train1_chart_LF_6,
             nrow = 2,
             ncol = 2)

Visualmente, após a apligação da função logarítmica, temos que a distribuição da variável LotFrontage_log se assemelha mais à Normal.

Relação entre as variáveis

#pairs(ds1_train1_final)

ggpairs(ds2_all[, -c(13, 25, 106)], 
        ggplot2::aes(colour = table, alpha = 0.5),
        upper = list(continuous = wrap("cor", size=2, binwidth=0.5)),
        lower=list(combo=wrap("facethist", binwidth=0.8))
        ) + theme_bw()
# num_cols <- unlist(lapply(ds2_all, is.numeric)) # Identify numeric columns (R BASE)
# h_train_cormat_1 <- ds2_all[ , num_cols]  


h_train_cormat_1 <- select_if(ds2_all, is.numeric) # Subset numeric columns (dplyr)
h_train_cormat_2 <- round( x = cor(na.omit(h_train_cormat_1)), digits=2)
#head(h_train_cormat_2)
h_train_cormat_3 <- melt(h_train_cormat_2)  # heatmap (reshape2)
#head(h_train_cormat_3)

h_train_cormat_4 <- ggplot(data = h_train_cormat_3, 
            aes(x=Var1, y=Var2, fill = value)) +
  geom_tile() +
  theme(axis.text.x=element_text(color = "black", size=7, angle=30, vjust=.8, hjust=0.8),
        axis.text.y=element_text(color = "black", size=7, angle=0, vjust=.8, hjust=0.8)
        ) +
  ggtitle("Correlation Heatmap")

ggplotly(h_train_cormat_4)

Separando somente os que tem correlação acima de 0.5 ou abaixo -0.5.

h_train_cormat_5 <- ggplot(data = h_train_cormat_3[h_train_cormat_3$value > 0.5 | 
                                      h_train_cormat_3$value < -0.5,], 
            aes(x=Var1, y=Var2, fill = value)) +
  geom_tile() +
  theme(axis.text.x=element_text(color = "black", size=7, angle=30, vjust=.8, hjust=0.8),
        axis.text.y=element_text(color = "black", size=7, angle=0, vjust=.8, hjust=0.8)
        ) +
  ggtitle("Correlation Heatmap")

ggplotly(h_train_cormat_5)

Abaixo temos a correlação em ordem decrescente de LotFrontage e as demais variáveis numéricas. Note que LotArea é a que mais se correlaciona com ela, com uma correlação de 0.49.

h_train_cormat_3a <- h_train_cormat_3[h_train_cormat_3$Var1 == "LotFrontage" ,] 

h_train_cormat_3a[order(h_train_cormat_3a$value, decreasing = TRUE),] %>%
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Var1 Var2 value
99 LotFrontage LotFrontage 1.00
2211 LotFrontage LotFrontage_log 0.94
2259 LotFrontage LotArea_log 0.69
147 LotFrontage LotArea 0.49
627 LotFrontage 1stFlrSF 0.45
771 LotFrontage GrLivArea 0.38
1299 LotFrontage GarageArea 0.38
579 LotFrontage TotalBsmtSF 0.35
1107 LotFrontage TotRmsAbvGrd 0.35
2163 LotFrontage SalePrice_log 0.33
1251 LotFrontage GarageCars 0.32
1779 LotFrontage SalePrice 0.31
1155 LotFrontage Fireplaces 0.25
1011 LotFrontage BedroomAbvGr 0.24
387 LotFrontage MasVnrArea 0.22
435 LotFrontage BsmtFinSF1 0.22
1923 LotFrontage FirePlaceDummy_ 0.22
195 LotFrontage OverallQual 0.21
1587 LotFrontage PoolArea 0.18
2019 LotFrontage PoolDummy_ 0.18
915 LotFrontage FullBath 0.17
1395 LotFrontage OpenPorchSF 0.17
1875 LotFrontage MasVnrDummy_ 0.17
819 LotFrontage BsmtFullBath 0.12
1347 LotFrontage WoodDeckSF 0.12
291 LotFrontage YearBuilt 0.11
531 LotFrontage BsmtUnfSF 0.11
339 LotFrontage YearRemodAdd 0.09
1203 LotFrontage GarageYrBlt 0.08
1539 LotFrontage ScreenPorch 0.07
1635 LotFrontage MiscVal 0.05
483 LotFrontage BsmtFinSF2 0.04
963 LotFrontage HalfBath 0.04
675 LotFrontage 2ndFlrSF 0.03
1491 LotFrontage 3SsnPorch 0.03
1443 LotFrontage EnclosedPorch 0.02
2115 LotFrontage MiscFeatureDummy_ 0.02
1683 LotFrontage MoSold 0.01
1059 LotFrontage KitchenAbvGr 0.00
2067 LotFrontage FenceDummy_ 0.00
723 LotFrontage LowQualFinSF -0.01
1731 LotFrontage YrSold -0.01
867 LotFrontage BsmtHalfBath -0.02
3 LotFrontage Id -0.03
1827 LotFrontage BsmtDummy_ -0.03
243 LotFrontage OverallCond -0.07
51 LotFrontage MSSubClass -0.42
1971 LotFrontage GarageDummy_ NA

Abaixo temos a correlação em ordem decrescente de LotFrontage_log e as demais variáveis numéricas. Note que LotArea_log é a que mais se correlaciona com ela, com uma correlação de 0.77.

h_train_cormat_3b <- h_train_cormat_3[h_train_cormat_3$Var1 == "LotFrontage_log" ,] 

h_train_cormat_3b[order(h_train_cormat_3b$value, decreasing = TRUE),] %>%
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Var1 Var2 value
2255 LotFrontage_log LotFrontage_log 1.00
143 LotFrontage_log LotFrontage 0.94
2303 LotFrontage_log LotArea_log 0.77
191 LotFrontage_log LotArea 0.45
671 LotFrontage_log 1stFlrSF 0.42
1151 LotFrontage_log TotRmsAbvGrd 0.35
1343 LotFrontage_log GarageArea 0.35
815 LotFrontage_log GrLivArea 0.34
623 LotFrontage_log TotalBsmtSF 0.33
2207 LotFrontage_log SalePrice_log 0.32
1823 LotFrontage_log SalePrice 0.30
1295 LotFrontage_log GarageCars 0.29
1055 LotFrontage_log BedroomAbvGr 0.27
1199 LotFrontage_log Fireplaces 0.23
1967 LotFrontage_log FirePlaceDummy_ 0.21
479 LotFrontage_log BsmtFinSF1 0.18
239 LotFrontage_log OverallQual 0.17
431 LotFrontage_log MasVnrArea 0.16
959 LotFrontage_log FullBath 0.15
1439 LotFrontage_log OpenPorchSF 0.15
575 LotFrontage_log BsmtUnfSF 0.13
1919 LotFrontage_log MasVnrDummy_ 0.13
1391 LotFrontage_log WoodDeckSF 0.11
1631 LotFrontage_log PoolArea 0.11
2063 LotFrontage_log PoolDummy_ 0.11
863 LotFrontage_log BsmtFullBath 0.09
1583 LotFrontage_log ScreenPorch 0.08
335 LotFrontage_log YearBuilt 0.07
383 LotFrontage_log YearRemodAdd 0.06
1247 LotFrontage_log GarageYrBlt 0.05
1679 LotFrontage_log MiscVal 0.05
527 LotFrontage_log BsmtFinSF2 0.04
2159 LotFrontage_log MiscFeatureDummy_ 0.04
1103 LotFrontage_log KitchenAbvGr 0.03
1487 LotFrontage_log EnclosedPorch 0.03
1535 LotFrontage_log 3SsnPorch 0.03
2111 LotFrontage_log FenceDummy_ 0.03
1727 LotFrontage_log MoSold 0.01
719 LotFrontage_log 2ndFlrSF 0.00
767 LotFrontage_log LowQualFinSF -0.01
1007 LotFrontage_log HalfBath -0.01
1775 LotFrontage_log YrSold -0.01
911 LotFrontage_log BsmtHalfBath -0.02
47 LotFrontage_log Id -0.03
287 LotFrontage_log OverallCond -0.04
1871 LotFrontage_log BsmtDummy_ -0.04
95 LotFrontage_log MSSubClass -0.55
2015 LotFrontage_log GarageDummy_ NA

Note que a variável LotArea é a que tem correlação mais alta. Vale ressaltar que também observamos os maiores valores negativos e não somente os positivos.

#plot(ds2_all$LotArea, ds2_all$LotFrontage)
#hist(ds2_all$LotArea)

h_train1_chart_LF_7 = ggplot(data = ds2_all[ds2_all$table == "train" & !is.na(ds2_all$LotArea_log),], 
                             aes(x = LotArea_log, y = LotFrontage_log))+
  geom_point(color="lightblue")+
  geom_smooth(method = "lm", se = TRUE, color='black')+
  ggtitle("Gráfico de Dispersão") +
  xlab("LotAre_log") + ylab("LotForntage_log")
  
ggplotly(h_train1_chart_LF_7)

A partir do gráfico acima, pode-se dizer que uma regressão linear simples entre as duas variáveis é possível. Vamos então verificar o ajuste desse modelo, ajustar os valores e inserí-los na base principal quando esse valor for faltante.

Para essa regressão, vou considerar a base toda (tanto train quanto test) e fazer uma subdivisão de treino e teste aleatória, para verificar se o modelo é de fato bem preditivo.

ds2_all_LF <- ds2_all[, c("Id", "LotFrontage_log", "LotArea_log")]
ds2_all_LF_rm <- na.omit(ds2_all_LF)

set.seed(123)
index_LF <- createDataPartition(ds2_all_LF_rm$LotFrontage_log, 
                                p = 0.7, 
                                list = FALSE)
ds2_LF_train <- ds2_all_LF_rm[index_LF, ]
ds2_LF_test  <- ds2_all_LF_rm[-index_LF, ]

Modelo 1: com intercepto

h_LF_model1 <- lm( LotFrontage_log ~ LotArea_log , 
                   data = ds2_LF_train)
summary(h_LF_model1)
## 
## Call:
## lm(formula = LotFrontage_log ~ LotArea_log, data = ds2_LF_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3118 -0.1255  0.0280  0.1439  1.2517 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -0.71076    0.10743  -6.616      0.0000000000492 ***
## LotArea_log  0.53940    0.01183  45.607 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2413 on 1704 degrees of freedom
## Multiple R-squared:  0.5497, Adjusted R-squared:  0.5494 
## F-statistic:  2080 on 1 and 1704 DF,  p-value: < 0.00000000000000022
#par(mfrow=c(2,2))
#plot(h_LF_model1)
autoplot(h_LF_model1)

Modelo 2: sem intercepto

h_LF_model2 <- lm( LotFrontage_log ~ LotArea_log - 1 , 
                   data = ds2_LF_train)
summary(h_LF_model2)
## 
## Call:
## lm(formula = LotFrontage_log ~ LotArea_log - 1, data = ds2_LF_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20400 -0.12192  0.02706  0.14598  1.19128 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## LotArea_log 0.4612642  0.0006511   708.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2443 on 1705 degrees of freedom
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9966 
## F-statistic: 5.019e+05 on 1 and 1705 DF,  p-value: < 0.00000000000000022
#par(mfrow=c(2,2))
#plot(h_LF_model2)
autoplot(h_LF_model2)

O Modelo 2 (sem intercepto) apresenta um R² de 99.7%.

anova(h_LF_model1, h_LF_model2)
## Analysis of Variance Table
## 
## Model 1: LotFrontage_log ~ LotArea_log
## Model 2: LotFrontage_log ~ LotArea_log - 1
##   Res.Df    RSS Df Sum of Sq      F           Pr(>F)    
## 1   1704  99.18                                         
## 2   1705 101.73 -1   -2.5479 43.775 0.00000000004919 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analisando a ANOVA para os dois modelos a diferença da Soma Quadrada dos Resíduos (RSS) não aumentou muito quando retiramos o intercepto, mas o ajuste do modelo (R²) melhorou significativamente, de 54.9 % para 99.7 %.

Vamos agora fazer o cross validation dos dois modelos usando resampling.

# Specify resampling plan
cv <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 5
)

set.seed(123)  # for reproducibility
cv_h_LF_model1 <- train(
  form = LotFrontage_log ~ LotArea_log, 
  data = ds2_LF_train, 
  method = "lm",
  trControl = cv,
  preProcess = c("center", "scale"),
  tuneGrid  = expand.grid(intercept = TRUE)
)

summary(cv_h_LF_model1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3118 -0.1255  0.0280  0.1439  1.2517 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 4.181395   0.005841  715.87 <0.0000000000000002 ***
## LotArea_log 0.266468   0.005843   45.61 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2413 on 1704 degrees of freedom
## Multiple R-squared:  0.5497, Adjusted R-squared:  0.5494 
## F-statistic:  2080 on 1 and 1704 DF,  p-value: < 0.00000000000000022
set.seed(123)  # for reproducibility
cv_h_LF_model2 <- train(
  form = LotFrontage_log ~ -1 + LotArea_log , 
  data = ds2_LF_train, 
  method = "lm",
  trControl = cv,
  preProcess = c("center", "scale"),
  tuneGrid  = expand.grid(intercept = FALSE)
)

summary(cv_h_LF_model2)
## 
## Call:
## lm(formula = .outcome ~ 0 + ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  2.870  4.056  4.209  4.325  5.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## LotArea_log   0.2665     0.1015   2.626  0.00871 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.19 on 1705 degrees of freedom
## Multiple R-squared:  0.004029,   Adjusted R-squared:  0.003445 
## F-statistic: 6.897 on 1 and 1705 DF,  p-value: 0.00871

Usando esse método de resampling, vemos que no final das contas, ambos modelos me dão erros e ajustes iguais, isso porque o cross validation nesse caso não consegue fazer a destinção de um modelo com ou sem intercepto. Mesmo usando a opção html tuneGrid = expand.grid(intercept = FALSE) mas isso só remove o incercepto depois do cálculo da regressão (com o intercepto) já ter sido feito, o que não é o que queremos.

#### Evaluation
# Extract out of sample performance measures
summary(resamples(list(
  model1 = cv_h_LF_model1, 
  model2 = cv_h_LF_model2
)))
## 
## Call:
## summary.resamples(object = resamples(list(model1 = cv_h_LF_model1, model2
##  = cv_h_LF_model2)))
## 
## Models: model1, model2 
## Number of resamples: 50 
## 
## MAE 
##             Min.  1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## model1 0.1458912 0.169162 0.1768206 0.1761007 0.1827132 0.2171508    0
## model2 4.1477406 4.171669 4.1803189 4.1813083 4.1941816 4.2140968    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## model1 0.1967674 0.2267414 0.2397586 0.2406741 0.2546231 0.3233613    0
## model2 4.1556224 4.1776655 4.1877780 4.1882604 4.2000731 4.2193947    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## model1 0.3374527 0.4809316 0.5679558 0.5527062 0.5943837 0.7124251    0
## model2 0.3374527 0.4809316 0.5679558 0.5527062 0.5943837 0.7124251    0

Dessa maneira, vamos fazer o teste de modo simple, usando a base de teste e calculando os erros.

summary(h_LF_model1)
## 
## Call:
## lm(formula = LotFrontage_log ~ LotArea_log, data = ds2_LF_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3118 -0.1255  0.0280  0.1439  1.2517 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -0.71076    0.10743  -6.616      0.0000000000492 ***
## LotArea_log  0.53940    0.01183  45.607 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2413 on 1704 degrees of freedom
## Multiple R-squared:  0.5497, Adjusted R-squared:  0.5494 
## F-statistic:  2080 on 1 and 1704 DF,  p-value: < 0.00000000000000022
summary(h_LF_model2)
## 
## Call:
## lm(formula = LotFrontage_log ~ LotArea_log - 1, data = ds2_LF_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20400 -0.12192  0.02706  0.14598  1.19128 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## LotArea_log 0.4612642  0.0006511   708.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2443 on 1705 degrees of freedom
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9966 
## F-statistic: 5.019e+05 on 1 and 1705 DF,  p-value: < 0.00000000000000022
ds2_LF_train$pred_m1 <- h_LF_model1$fitted.values
ds2_LF_train$pred_m2 <- h_LF_model2$fitted.values

ds2_LF_test$pred_m1 <- h_LF_model1 %>% predict(ds2_LF_test)
ds2_LF_test$pred_m2 <- h_LF_model2 %>% predict(ds2_LF_test)
table1_train_m1 <- data.frame(R2 = R2(ds2_LF_train$pred_m1, ds2_LF_train$LotFrontage_log),
                              RMSE = RMSE(ds2_LF_train$pred_m1, ds2_LF_train$LotFrontage_log),
                              MAE = MAE(ds2_LF_train$pred_m1, ds2_LF_train$LotFrontage_log)
                              )

table1_train_m2 <- data.frame(R2 = R2(ds2_LF_train$pred_m2, ds2_LF_train$LotFrontage_log),
                              RMSE = RMSE(ds2_LF_train$pred_m2, ds2_LF_train$LotFrontage_log),
                              MAE = MAE(ds2_LF_train$pred_m2, ds2_LF_train$LotFrontage_log)
                              )

table1_test_m1 <- data.frame(R2 = R2(ds2_LF_test$pred_m1, ds2_LF_test$LotFrontage_log),
                             RMSE = RMSE(ds2_LF_test$pred_m1, ds2_LF_test$LotFrontage_log),
                             MAE = MAE(ds2_LF_test$pred_m1, ds2_LF_test$LotFrontage_log)
                             )

table1_test_m2 <- data.frame(R2 = R2(ds2_LF_test$pred_m2, ds2_LF_test$LotFrontage_log),
                             RMSE = RMSE(ds2_LF_test$pred_m2, ds2_LF_test$LotFrontage_log),
                             MAE = MAE(ds2_LF_test$pred_m2, ds2_LF_test$LotFrontage_log)
                             )

table1 <- rbind(table1_train_m1, table1_train_m2, table1_test_m1, table1_test_m2)
row.names(table1) <- c("train_model_1", "train_model_2", "test_model_1", "test_model_2")

table1 %>% kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
R2 RMSE MAE
train_model_1 0.5496813 0.2411142 0.1758202
train_model_2 0.5496813 0.2441916 0.1805571
test_model_1 0.6800449 0.2011005 0.1538040
test_model_2 0.6800449 0.2086177 0.1598457

Como podemos ver as estatisticas que comparam os modelos, não faz muita diferença entre se ter ou não o intercepto, e como é mais lógico neste caso, vou usar o modelo sem intercepto como o modelo final.

summary(h_LF_model2)
## 
## Call:
## lm(formula = LotFrontage_log ~ LotArea_log - 1, data = ds2_LF_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20400 -0.12192  0.02706  0.14598  1.19128 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## LotArea_log 0.4612642  0.0006511   708.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2443 on 1705 degrees of freedom
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9966 
## F-statistic: 5.019e+05 on 1 and 1705 DF,  p-value: < 0.00000000000000022
exp(h_LF_model2$coefficients)
## LotArea_log 
##    1.586078

E explicação lógica desse modelo é a distância entre a casa e a rua possui uma relação linear com o tamanho total do terreno. Pelo modelo, temos que a cada uma unidade de acrescimo na variável LrArea_log, a variável LotFrontage_log aumenta 0.4612642 . Se voltarmos ambas as variáveis para o normal (aplicandoa função exponencial), temos que cada uma unidade de acrescimo na variável LrArea_log, a variável LotFrontage aumenta 1.5860778 .

Vamos então criar a variável preditiva na base de dados original e utilizar esses valores para preencher os faltantes.

ds2_all$LotFrontage_log_pred = h_LF_model2 %>% predict(ds2_all)
ds2_all$LotFrontage_log_2 <- ifelse(is.na(ds2_all$LotFrontage_log),
                                    ds2_all$LotFrontage_log_pred,
                                    ds2_all$LotFrontage_log)


ds2_all$LotFrontage_2 <- ifelse(is.na(ds2_all$LotFrontage),
                                    exp(ds2_all$LotFrontage_log_pred),
                                    ds2_all$LotFrontage)

Vamos verificar como ficou a variável final com relação à original.

h_train1_chart_LF_8 <- ggplot(ds2_all, aes(x = LotArea, 
                                           y = LotFrontage_2, 
                                           fill = is.na(LotFrontage))) +
  geom_boxplot() +
  ggtitle("Gráfico de Box-plot") +
  xlab("LotArea (ft²)") +
  ylab("LotFrontage_2 (ft²)")

h_train1_chart_LF_8

Resumo da variável que acabamos de predizer para a base de treino:

summary(ds2_all[ds2_all$table == "train" , c("LotFrontage", "LotFrontage_2")])
##   LotFrontage     LotFrontage_2   
##  Min.   : 21.00   Min.   : 21.00  
##  1st Qu.: 59.00   1st Qu.: 60.00  
##  Median : 69.00   Median : 70.00  
##  Mean   : 70.05   Mean   : 70.81  
##  3rd Qu.: 80.00   3rd Qu.: 80.00  
##  Max.   :313.00   Max.   :313.00  
##  NA's   :259

Resumo da variável que acabamos de predizer para a base de teste:

summary(ds2_all[ds2_all$table == "test" , c("LotFrontage", "LotFrontage_2")])
##   LotFrontage     LotFrontage_2   
##  Min.   : 21.00   Min.   : 21.00  
##  1st Qu.: 58.00   1st Qu.: 60.00  
##  Median : 67.00   Median : 68.00  
##  Mean   : 68.58   Mean   : 69.18  
##  3rd Qu.: 80.00   3rd Qu.: 80.00  
##  Max.   :200.00   Max.   :200.00  
##  NA's   :227

A variável ajustada parece estar um pouco mais inflada que a original, mas nada muito descrepante. Os valores mínimo e máximo continuam os mesmos em ambos subsets.

Concluímos então essa etapa e vamos olhar agora para o modelo principal.

Variável Ano de Construção da Garagem (GarageYrBlt)

Notamos que temos muitos valores faltantes nessa variável.

ds2_all_GYB_summary1 = ds2_all %>% 
  dplyr::select(Id, GarageYrBlt, GarageDummy_, table) %>% 
  group_by(table, GarageDummy_) %>% 
  #filter(raca=="Branca") %>% 
  summarise(
    n = n(),
    min = min(GarageYrBlt, na.rm = T),
    max = max(GarageYrBlt, na.rm = T),
    mean = mean(GarageYrBlt, na.rm = T),
    median = median(GarageYrBlt, na.rm = T),
    sd = sd(GarageYrBlt, na.rm = T),
    missing = sum(is.na(GarageYrBlt))
  ) 

ds2_all_GYB_summary1 %>% kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
table GarageDummy_ n min max mean median sd missing
test 0 76 Inf -Inf NaN NA NA 76
test 1 1383 1895 2207 1977.721 1979 26.43117 2
train 0 81 Inf -Inf NaN NA NA 81
train 1 1379 1900 2010 1978.506 1980 24.68972 0

Pela tablea acima é possível notar que todos os valores faltantes são de fato para propriedades que não possuem garagem, o que estaria correto.
Uma coisa que chama bastante atenção é o ano máximo da base de teste, mostrando o que possivelmente foi erro de digitação. Assim, vamos setar o ano de 2022, pois é quando estamos fazendo esse trabalho e qualquer ano acima disso vamos colocar como nulo.

ds2_all$GarageYrBlt_ <- ifelse(ds2_all$GarageYrBlt > 2022, NA, ds2_all$GarageYrBlt) 
ds2_all_GYB_summary2 = ds2_all %>% 
  dplyr::select(Id, GarageYrBlt_, GarageDummy_, table) %>% 
  group_by(table, GarageDummy_) %>% 
  #filter(raca=="Branca") %>% 
  summarise(
    n = n(),
    min = min(GarageYrBlt_, na.rm = T),
    max = max(GarageYrBlt_, na.rm = T),
    mean = mean(GarageYrBlt_, na.rm = T),
    median = median(GarageYrBlt_, na.rm = T),
    sd = sd(GarageYrBlt_, na.rm = T),
    missing = sum(is.na(GarageYrBlt_))
  ) 

ds2_all_GYB_summary2 %>% kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
table GarageDummy_ n min max mean median sd missing
test 0 76 Inf -Inf NaN NA NA 76
test 1 1383 1895 2010 1977.555 1979 25.70924 3
train 0 81 Inf -Inf NaN NA NA 81
train 1 1379 1900 2010 1978.506 1980 24.68972 0

Base de Dados Ajustada

Aqui vamos remover as colunas que necessitavam de ajustes e deixar somente as ajustadas.

# removing old variables
ds2_all_semi = ds2_all %>% dplyr::select(
Id,
table,
SalePrice,
SalePrice_log,
`1stFlrSF`,
`2ndFlrSF`,
`3SsnPorch`,
#Alley,
Alley_,
BedroomAbvGr,
BldgType,
BsmtDummy_,
#BsmtCond,
BsmtCond_,
#BsmtExposure,
BsmtExposure_,
BsmtFinSF1,
BsmtFinSF2,
#BsmtFinType1,
BsmtFinType1_,
#BsmtFinType2,
BsmtFinType2_,
BsmtFullBath,
BsmtHalfBath,
#BsmtQual,
BsmtQual_,
BsmtUnfSF,
#CentralAir,
CentralAir_,
Condition1,
Condition2,
Electrical,
EnclosedPorch,
#ExterCond,
ExterCond_,
Exterior1st,
Exterior2nd,
#ExterQual,
ExterQual_,
FenceDummy_,
#Fence,
Fence_,
FirePlaceDummy_,
#FireplaceQu,
FireplaceQu_,
Fireplaces,
Foundation,
FullBath,
Functional,
GarageArea,
GarageCars,
GarageDummy_,
#GarageCond,
GarageCond_,
#GarageFinish,
GarageFinish_,
#GarageQual,
GarageQual_,
#GarageType,
GarageType_,
#GarageYrBlt,
GarageYrBlt_,
GrLivArea,
HalfBath,
Heating,
#HeatingQC,
HeatingQC_,
HouseStyle,
KitchenAbvGr,
#KitchenQual,
KitchenQual_,
LandContour,
LandSlope,
LotArea,
#LotArea_log,
LotConfig,
#LotFrontage,
LotFrontage_2,
#LotFrontage_log,
#LotFrontage_log_2,
#LotFrontage_log_pred,
LotShape,
LowQualFinSF,
MasVnrDummy_,
MasVnrArea,
#MasVnrType,
MasVnrType_,
MiscFeatureDummy_,
#MiscFeature,
MiscFeature_,
MiscVal,
MoSold,
#MSSubClass,
MSSubClass_,
MSZoning,
Neighborhood,
OpenPorchSF,
#OverallCond,
OverallCond_,
#OverallQual,
OverallQual_,
PavedDrive,
PoolArea,
PoolDummy_,
#PoolQC,
PoolQC_,
RoofMatl,
RoofStyle,
SaleCondition,
SaleType,
ScreenPorch,
Street,
TotalBsmtSF,
TotRmsAbvGrd,
Utilities,
WoodDeckSF,
YearBuilt,
YearRemodAdd,
YrSold
)

Valores faltantes

Verificar a quantidade de valores faltantes na base final.

Train

h_train2_summary1 <- as.data.frame(t(ds2_all_semi[ds2_all_semi$table == "train",] %>% 
  summarise(across(everything(), ~ sum(is.na(.x))))))
colnames(h_train2_summary1) <- c("Missing Observations")
h_train2_summary1$`-` <- ""

h_train2_summary1[h_train2_summary1$`Missing Observations` > 0, ] %>% 
  arrange(desc(`Missing Observations`)) %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Missing Observations
GarageYrBlt_ 81
MasVnrDummy_ 8
MasVnrArea 8
MasVnrType_ 8
BsmtExposure_ 1
BsmtFinType2_ 1
Electrical 1

Test

h_test2_summary1 <- as.data.frame(t(ds2_all_semi[ds2_all_semi$table == "test",] %>% 
  summarise(across(everything(), ~ sum(is.na(.x))))))
colnames(h_test2_summary1) <- c("Missing Observations")
h_test2_summary1$`-` <- ""

h_test2_summary1[h_test2_summary1$`Missing Observations` > 0, ] %>% 
  arrange(desc(`Missing Observations`)) %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Missing Observations
GarageYrBlt_ 79
MasVnrType_ 16
MasVnrDummy_ 15
MasVnrArea 15
BsmtCond_ 4
MSZoning 4
BsmtExposure_ 3
BsmtQual_ 3
PoolQC_ 3
BsmtFullBath 2
BsmtHalfBath 2
Functional 2
GarageCond_ 2
GarageFinish_ 2
GarageQual_ 2
Utilities 2
BsmtDummy_ 1
BsmtFinSF1 1
BsmtFinSF2 1
BsmtFinType1_ 1
BsmtFinType2_ 1
BsmtUnfSF 1
Exterior1st 1
Exterior2nd 1
GarageArea 1
GarageCars 1
KitchenQual_ 1
SaleType 1
TotalBsmtSF 1

Vamos remover os valores faltantes de forma condicional:

ds2_all_semi = ds2_all_semi %>% mutate(
  remove_rows = case_when(
    GarageDummy_ == 1 & is.na(GarageYrBlt_) == TRUE ~ 1 ,
    GarageDummy_ == 1 & is.na(GarageArea) == TRUE ~ 1 ,
    GarageDummy_ == 1 & is.na(GarageCars) == TRUE ~ 1 ,
    GarageDummy_ == 1 & is.na(GarageCond_) == TRUE ~ 1 ,
    GarageDummy_ == 1 & is.na(GarageFinish_) == TRUE ~ 1 ,
    GarageDummy_ == 1 & is.na(GarageQual_) == TRUE ~ 1 ,
    PoolDummy_ == 1 & is.na(PoolQC_) == TRUE ~ 1 ,
    is.na(BsmtDummy_) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtCond_) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtQual_) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtExposure_) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtFinType1_) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtFinType2_) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtFullBath) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtHalfBath) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtFinSF1) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtFinSF2) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(BsmtUnfSF) == TRUE ~ 1 ,
    BsmtDummy_ == 1 & is.na(TotalBsmtSF) == TRUE ~ 1 ,
    is.na(Exterior1st) == TRUE ~ 1 ,
    is.na(Functional) == TRUE ~ 1 ,
    KitchenAbvGr > 0 & is.na(KitchenQual_) == TRUE ~ 1 ,
    is.na(MasVnrDummy_) == TRUE ~ 1 ,
    MasVnrDummy_ == 1 & is.na(MasVnrType_) == TRUE ~ 1 ,
    MasVnrDummy_ == 1 & is.na(MasVnrArea) == TRUE ~ 1 ,
    is.na(MSZoning) == TRUE ~ 1 ,
    is.na(SaleType) == TRUE ~ 1 ,
    is.na(Utilities) == TRUE ~ 1 ,
    is.na(Electrical) == TRUE ~ 1 ,
    
    TRUE ~ 0
  )
)
h_both2_summary1 <- as.data.frame(t(ds2_all_semi[ds2_all_semi$remove_rows == 0,] %>% 
  summarise(across(everything(), ~ sum(is.na(.x))))))
colnames(h_both2_summary1) <- c("Missing Observations")
h_both2_summary1$`-` <- ""

h_both2_summary1[h_both2_summary1$`Missing Observations` > 0, ] %>% 
  arrange(desc(`Missing Observations`)) %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Missing Observations
GarageYrBlt_ 155
BsmtFullBath 1
BsmtHalfBath 1

Note que agora restaram somente os valores que são de fato valores reais. Para todos os casos, vamos deixar como está no momento.

ds2_all_semi[ds2_all_semi$remove_rows == 0 & is.na(ds2_all_semi$BsmtQual_),] %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)

ds2_all_semi[ds2_all_semi$remove_rows == 0 & ds2_all_semi$BsmtFinType1_ == "Unf",] %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)

ds2_all_semi[ds2_all_semi$remove_rows == 0 & is.na(ds2_all_semi$Electrical),] %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)

ds2_all_semi[ds2_all_semi$remove_rows == 0 & is.na(ds2_all_semi$BsmtHalfBath),] %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)

ds2_all_semi[is.na(ds2_all_semi$BsmtHalfBath),] %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)

Relação entre as variáveis

ggpairs(ds2_all_semi[ds2_all_semi$table == "train", c(3:4,5:20)], 
        #ggplot2::aes(colour = Transported_, alpha = 0.5),
        upper = list(continuous = wrap("cor", size=2, binwidth=0.5))
        ) + theme_bw()
ds2_all_semi_Bsmt = ds2_all_semi[ds2_all_semi$table == "train",] %>% 
  dplyr::select("SalePrice" | contains("Bsm")) 

ggpairs(ds2_all_semi_Bsmt,
        ggplot2::aes(colour = as.factor(BsmtDummy_), alpha = 0.5)
        ) + theme_bw()
ggpairs(ds2_all_semi[ds2_all_semi$table == "train", c(3:4,11:12)],
        ggplot2::aes(colour = as.factor(BsmtDummy_), alpha = 0.5)
        ) + theme_bw()

Pleo gráfico ascima podemos ver que os preços das casas são mais elevados quando a casa possui um porão. A correlação entre elas é fraca 0.2 .

ggpairs(ds2_all_semi[ds2_all_semi$table == "train", c(3:4,11,56,84)],
        ggplot2::aes(colour = as.factor(BsmtDummy_), alpha = 0.5)
        ) + theme_bw()

A correlação entre A área total do porão e o log do preço de venda é 0.61 . Nem mesmo a área do lote está tão fortemente correlacionada com o preço. Aparentemente, área construída agrega mais valor na venda.

Utilizando o mapa de calor que fizemos acima,

h_train1_cormat_1 <- select_if(ds2_all_semi[ds2_all_semi$table == "train", ], is.numeric) # Subset numeric columns (dplyr)
h_train1_cormat_2 <- round( x = cor(na.omit(h_train_cormat_1)), digits=2)
#head(h_train1_cormat_2)
h_train1_cormat_3 <- melt(h_train_cormat_2)  # heatmap (reshape2)
#head(h_train1_cormat_3)

h_train1_cormat_4 <- ggplot(data = h_train_cormat_3, 
            aes(x=Var1, y=Var2, fill = value)) +
  geom_tile() +
  theme(axis.text.x=element_text(color = "black", size=7, angle=30, vjust=.8, hjust=0.8),
        axis.text.y=element_text(color = "black", size=7, angle=0, vjust=.8, hjust=0.8)
        ) +
  ggtitle("Correlation Heatmap")

ggplotly(h_train1_cormat_4)
h_train1_cormat_5 <- ggplot(data = h_train1_cormat_3[h_train1_cormat_3$value > 0.5 | 
                                      h_train1_cormat_3$value < -0.5,], 
            aes(x=Var1, y=Var2, fill = value)) +
  geom_tile() +
  theme(axis.text.x=element_text(color = "black", size=7, angle=30, vjust=.8, hjust=0.8),
        axis.text.y=element_text(color = "black", size=7, angle=0, vjust=.8, hjust=0.8)
        ) +
  ggtitle("Correlation Heatmap")

ggplotly(h_train1_cormat_5)
h_train1_cormat_4$data[h_train1_cormat_4$data$Var1 == "SalePrice_log" ,] %>%
  arrange(desc(value)) %>%
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Var1 Var2 value
SalePrice_log SalePrice_log 1.00
SalePrice_log SalePrice 0.95
SalePrice_log GrLivArea 0.57
SalePrice_log OverallQual 0.56
SalePrice_log GarageCars 0.49
SalePrice_log TotRmsAbvGrd 0.47
SalePrice_log GarageArea 0.47
SalePrice_log FullBath 0.46
SalePrice_log TotalBsmtSF 0.45
SalePrice_log 1stFlrSF 0.45
SalePrice_log YearBuilt 0.41
SalePrice_log YearRemodAdd 0.41
SalePrice_log GarageYrBlt 0.38
SalePrice_log LotArea_log 0.37
SalePrice_log FirePlaceDummy_ 0.36
SalePrice_log Fireplaces 0.35
SalePrice_log MasVnrArea 0.34
SalePrice_log LotFrontage 0.33
SalePrice_log LotFrontage_log 0.32
SalePrice_log LotArea 0.31
SalePrice_log MasVnrDummy_ 0.29
SalePrice_log OpenPorchSF 0.27
SalePrice_log BsmtFinSF1 0.26
SalePrice_log 2ndFlrSF 0.26
SalePrice_log BedroomAbvGr 0.26
SalePrice_log WoodDeckSF 0.26
SalePrice_log HalfBath 0.24
SalePrice_log BsmtUnfSF 0.18
SalePrice_log BsmtFullBath 0.16
SalePrice_log BsmtDummy_ 0.10
SalePrice_log ScreenPorch 0.09
SalePrice_log MoSold 0.08
SalePrice_log Id 0.06
SalePrice_log PoolArea 0.06
SalePrice_log PoolDummy_ 0.06
SalePrice_log 3SsnPorch 0.02
SalePrice_log MiscVal 0.01
SalePrice_log LowQualFinSF 0.00
SalePrice_log BsmtFinSF2 -0.01
SalePrice_log BsmtHalfBath -0.02
SalePrice_log YrSold -0.04
SalePrice_log MiscFeatureDummy_ -0.06
SalePrice_log OverallCond -0.07
SalePrice_log MSSubClass -0.10
SalePrice_log KitchenAbvGr -0.10
SalePrice_log EnclosedPorch -0.10
SalePrice_log FenceDummy_ -0.14
SalePrice_log GarageDummy_ NA

Analisando o gráfico acima, temos que:

  • GrLivArea: cor = 0.57 )
  • OverallQual: cor = 0.56 )
    sendo essas as variáveis que possuem mais correlação com o preço de venda da casas.
ggpairs(ds2_all_semi[ds2_all_semi$table == "train", c(3:4,11,56,84, 76, 73, 47, 80, 88:90)],
        ggplot2::aes(colour = SaleCondition, alpha = 0.5),
        upper = list(continuous = wrap("cor", size=2, binwidth=0.5))
        ) + theme_bw()

Conseguimor ver que a variável Área total do porão parece ser linearmente relacionada com o log to preço de venda da casa.

Gráficos Gerais

SalePrice_log & OverallQual_
plot(x=ds2_all_semi[ds2_all_semi$table == "train", ]$OverallQual_, 
     y=ds2_all_semi[ds2_all_semi$table == "train", ]$SalePrice_log,
     xlab = "OverallQual_", ylab = "SalePrice_log")

SalePrice_log & GrLivArea
plot(x=ds2_all_semi[ds2_all_semi$table == "train", ]$GrLivArea, 
     y=ds2_all_semi[ds2_all_semi$table == "train", ]$SalePrice_log,
     xlab = "GrLivArea", ylab = "SalePrice_log")

SalePrice_log & GarageCars
plot(x=ds2_all_semi[ds2_all_semi$table == "train", ]$GarageCars, 
     y=ds2_all_semi[ds2_all_semi$table == "train", ]$SalePrice_log,
     xlab = "GarageCars", ylab = "SalePrice_log")

SalePrice_log & TotalBsmtSF
plot(x=ds2_all_semi[ds2_all_semi$table == "train", ]$TotalBsmtSF, 
     y=ds2_all_semi[ds2_all_semi$table == "train", ]$SalePrice_log,
     xlab = "TotalBsmtSF", ylab = "SalePrice_log")

SalePrice_log & 1stFlrSF
plot(x=ds2_all_semi[ds2_all_semi$table == "train", ]$`1stFlrSF`, 
     y=ds2_all_semi[ds2_all_semi$table == "train", ]$SalePrice_log,
     xlab = "1stFlrSF", ylab = "SalePrice_log")

SalePrice_log & YearBuilt
boxplot(ds2_all_semi[ds2_all_semi$table == "train", ]$SalePrice_log ~ 
          ds2_all_semi[ds2_all_semi$table == "train", ]$YearBuilt, 
     xlab = "YearBuilt", ylab = "SalePrice_log")

Modelagem [Modelling]

A variável resposta é do tipo quantitativa contínua, pois ão os valores em que as casas foram vendidas. Assim, podemos usar Regressão Linear, e como já foi mostrado acima, vamos usar o log do preço de venda ao invés da variável original.

ds2_all_train <- ds2_all_semi[ds2_all_semi$table == "train" & ds2_all_semi$remove_rows == 0, -c(2,91,46)]
ds2_all_test <- ds2_all_semi[ds2_all_semi$table == "test" & ds2_all_semi$remove_rows == 0, -c(2,91,46)]

Modelo 1

Modelo com todas as variáveis.

##Model 1
houses_model1 <- lm(SalePrice_log ~ . , 
                     data = ds2_all_train[,-c(1,2)])
summary(houses_model1)
## 
## Call:
## lm(formula = SalePrice_log ~ ., data = ds2_all_train[, -c(1, 
##     2)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68950 -0.04404  0.00093  0.04902  0.68823 
## 
## Coefficients: (18 not defined because of singularities)
##                          Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)           9.445272939  6.664463206   1.417             0.156673    
## `1stFlrSF`            0.000087859  0.000090567   0.970             0.332197    
## `2ndFlrSF`            0.000094861  0.000084729   1.120             0.263125    
## `3SsnPorch`           0.000151058  0.000101480   1.489             0.136875    
## Alley_Grvl            0.008003877  0.019423966   0.412             0.680369    
## Alley_Pave            0.032578630  0.022542689   1.445             0.148670    
## BedroomAbvGr          0.005403449  0.006463330   0.836             0.403316    
## BldgType2fmCon       -0.110650935  0.126969824  -0.871             0.383674    
## BldgTypeDuplex       -0.025038660  0.034407879  -0.728             0.466942    
## BldgTypeTwnhs        -0.010116974  0.070666494  -0.143             0.886184    
## BldgTypeTwnhsE        0.013549673  0.067210114   0.202             0.840263    
## BsmtDummy_           -3.040371047  6.062429219  -0.502             0.616106    
## BsmtCond_.L           3.075055785  4.300318121   0.715             0.474704    
## BsmtCond_.Q          -1.980372359  2.826051449  -0.701             0.483594    
## BsmtCond_.C           0.646667919  0.954842115   0.677             0.498380    
## BsmtCond_^4                    NA           NA      NA                   NA    
## BsmtExposure_.L       0.056815304  0.048244347   1.178             0.239172    
## BsmtExposure_.Q      -0.006540911  0.043595850  -0.150             0.880763    
## BsmtExposure_.C       0.021178930  0.026540327   0.798             0.425037    
## BsmtExposure_^4                NA           NA      NA                   NA    
## BsmtFinSF1            0.000147958  0.000024713   5.987  0.00000000283628147 ***
## BsmtFinSF2            0.000139854  0.000041414   3.377             0.000757 ***
## BsmtFinType1_.L      -0.146692672  0.187435394  -0.783             0.434002    
## BsmtFinType1_.Q       0.174672510  0.182923561   0.955             0.339828    
## BsmtFinType1_.C      -0.131003342  0.137533876  -0.953             0.341031    
## BsmtFinType1_^4       0.077718831  0.080984435   0.960             0.337416    
## BsmtFinType1_^5      -0.019730971  0.034979843  -0.564             0.572816    
## BsmtFinType1_^6                NA           NA      NA                   NA    
## BsmtFinType2_.L      -0.524605497  0.300899772  -1.743             0.081516 .  
## BsmtFinType2_.Q       0.540580802  0.292328232   1.849             0.064677 .  
## BsmtFinType2_.C      -0.373627820  0.221675314  -1.685             0.092163 .  
## BsmtFinType2_^4       0.203491799  0.128369319   1.585             0.113189    
## BsmtFinType2_^5      -0.140003210  0.066684052  -2.100             0.035986 *  
## BsmtFinType2_^6                NA           NA      NA                   NA    
## BsmtFullBath          0.025123416  0.009039814   2.779             0.005536 ** 
## BsmtHalfBath          0.008603663  0.013825206   0.622             0.533854    
## BsmtQual_.L          -0.040120177  0.069962710  -0.573             0.566449    
## BsmtQual_.Q           0.036478582  0.049150205   0.742             0.458123    
## BsmtQual_.C           0.002161504  0.023622498   0.092             0.927109    
## BsmtQual_^4                    NA           NA      NA                   NA    
## BsmtUnfSF             0.000092249  0.000022559   4.089  0.00004622374626898 ***
## CentralAir_Yes        0.054437351  0.017783930   3.061             0.002256 ** 
## Condition1Feedr       0.032967524  0.022945831   1.437             0.151054    
## Condition1Norm        0.086777602  0.019123272   4.538  0.00000626887814370 ***
## Condition1PosA        0.045699297  0.045591705   1.002             0.316377    
## Condition1PosN        0.093634887  0.034030407   2.752             0.006024 ** 
## Condition1RRAe       -0.031907091  0.041362164  -0.771             0.440621    
## Condition1RRAn        0.058638267  0.031535897   1.859             0.063219 .  
## Condition1RRNe        0.014450171  0.079142976   0.183             0.855157    
## Condition1RRNn        0.072838364  0.058505353   1.245             0.213385    
## Condition2Feedr       0.193409940  0.123161096   1.570             0.116596    
## Condition2Norm        0.169925200  0.111562937   1.523             0.127995    
## Condition2PosA        0.361693476  0.181897986   1.988             0.046995 *  
## Condition2PosN       -0.677951784  0.141949224  -4.776  0.00000201386966487 ***
## Condition2RRAe       -0.704800999  0.334725005  -2.106             0.035450 *  
## Condition2RRAn        0.077538819  0.157213646   0.493             0.621959    
## Condition2RRNn        0.164572750  0.139323833   1.181             0.237753    
## ElectricalFuseF      -0.004792882  0.027060393  -0.177             0.859446    
## ElectricalFuseP      -0.102365794  0.085522751  -1.197             0.231571    
## ElectricalMix         0.902287332  1.482118925   0.609             0.542787    
## ElectricalSBrkr      -0.031429039  0.013818263  -2.274             0.023118 *  
## EnclosedPorch         0.000115499  0.000057672   2.003             0.045439 *  
## ExterCond_.L          0.136190133  0.106661783   1.277             0.201911    
## ExterCond_.Q         -0.052352105  0.088372904  -0.592             0.553697    
## ExterCond_.C          0.054460595  0.055003908   0.990             0.322319    
## ExterCond_^4          0.009275298  0.023426937   0.396             0.692233    
## Exterior1stAsphShn    0.018716989  0.150125882   0.125             0.900802    
## Exterior1stBrkComm   -0.266367659  0.132088764  -2.017             0.043967 *  
## Exterior1stBrkFace    0.091455184  0.059930173   1.526             0.127272    
## Exterior1stCBlock    -0.153586692  0.127971982  -1.200             0.230320    
## Exterior1stCemntBd   -0.071167407  0.089182524  -0.798             0.425034    
## Exterior1stHdBoard   -0.000652808  0.061136362  -0.011             0.991482    
## Exterior1stImStucc    0.020093421  0.129065749   0.156             0.876309    
## Exterior1stMetalSd    0.050374341  0.069079363   0.729             0.466010    
## Exterior1stPlywood    0.005307285  0.060030636   0.088             0.929566    
## Exterior1stStone      0.037803432  0.113015094   0.334             0.738063    
## Exterior1stStucco     0.032833624  0.067603922   0.486             0.627287    
## Exterior1stVinylSd    0.010507384  0.063113781   0.166             0.867805    
## Exterior1stWd Sdng   -0.024276280  0.058729686  -0.413             0.679421    
## Exterior1stWdShing    0.023096146  0.063000914   0.367             0.713983    
## Exterior2ndAsphShn    0.045066083  0.102013774   0.442             0.658741    
## Exterior2ndBrk Cmn    0.098241134  0.094430805   1.040             0.298391    
## Exterior2ndBrkFace   -0.032134623  0.061598307  -0.522             0.601991    
## Exterior2ndCBlock              NA           NA      NA                   NA    
## Exterior2ndCmentBd    0.125866513  0.086988217   1.447             0.148182    
## Exterior2ndHdBoard    0.013748235  0.058346421   0.236             0.813760    
## Exterior2ndImStucc    0.017637653  0.067164300   0.263             0.792901    
## Exterior2ndMetalSd   -0.008829215  0.066880939  -0.132             0.894996    
## Exterior2ndOther     -0.085158811  0.123464627  -0.690             0.490493    
## Exterior2ndPlywood    0.010148482  0.056634515   0.179             0.857817    
## Exterior2ndStone      0.007602025  0.084643774   0.090             0.928452    
## Exterior2ndStucco     0.000274382  0.064390612   0.004             0.996601    
## Exterior2ndVinylSd    0.024623306  0.060332806   0.408             0.683257    
## Exterior2ndWd Sdng    0.043697237  0.056268949   0.777             0.437564    
## Exterior2ndWd Shng    0.004687541  0.058696184   0.080             0.936361    
## ExterQual_.L         -0.044033860  0.040318987  -1.092             0.274997    
## ExterQual_.Q          0.035489287  0.029123228   1.219             0.223245    
## ExterQual_.C         -0.013677312  0.014455511  -0.946             0.344259    
## FenceDummy_          -0.022463836  0.033938958  -0.662             0.508172    
## Fence_GdPrv           0.016196968  0.037440199   0.433             0.665378    
## Fence_GdWo           -0.019873353  0.036589366  -0.543             0.587133    
## Fence_MnPrv           0.012154413  0.034694346   0.350             0.726155    
## Fence_MnWw                     NA           NA      NA                   NA    
## FirePlaceDummy_       0.311834458  0.249548333   1.250             0.211697    
## FireplaceQu_.L       -0.191958730  0.152138729  -1.262             0.207295    
## FireplaceQu_.Q        0.157890400  0.142558346   1.108             0.268284    
## FireplaceQu_.C       -0.123362767  0.091395109  -1.350             0.177349    
## FireplaceQu_^4        0.036996165  0.037463065   0.988             0.323583    
## FireplaceQu_^5                 NA           NA      NA                   NA    
## Fireplaces            0.017349622  0.011752131   1.476             0.140133    
## FoundationCBlock      0.006393041  0.014815407   0.432             0.666175    
## FoundationPConc       0.029196755  0.015853318   1.842             0.065774 .  
## FoundationSlab       -0.001279006  0.046184922  -0.028             0.977912    
## FoundationStone       0.086460472  0.052208032   1.656             0.097974 .  
## FoundationWood       -0.122249736  0.067336280  -1.816             0.069701 .  
## FullBath              0.027593779  0.010177670   2.711             0.006802 ** 
## FunctionalMaj2       -0.277554753  0.067536424  -4.110  0.00004237132855035 ***
## FunctionalMin1       -0.011446829  0.040655501  -0.282             0.778333    
## FunctionalMin2       -0.043187008  0.041485228  -1.041             0.298080    
## FunctionalMod        -0.107010152  0.049576353  -2.158             0.031092 *  
## FunctionalSev        -0.282356071  0.135553951  -2.083             0.037469 *  
## FunctionalTyp         0.004708831  0.036255061   0.130             0.896683    
## GarageArea            0.000117805  0.000035215   3.345             0.000848 ***
## GarageCars            0.017250344  0.010478907   1.646             0.099991 .  
## GarageDummy_          1.171128298  0.593479097   1.973             0.048693 *  
## GarageCond_.L        -0.397328697  0.287290820  -1.383             0.166922    
## GarageCond_.Q         0.028754417  0.300990481   0.096             0.923908    
## GarageCond_.C        -0.201930863  0.135627848  -1.489             0.136793    
## GarageCond_^4        -0.039312517  0.099366410  -0.396             0.692449    
## GarageCond_^5                  NA           NA      NA                   NA    
## GarageFinish_Fin      0.009559999  0.011081134   0.863             0.388464    
## GarageFinish_RFn      0.014324603  0.009901745   1.447             0.148255    
## GarageFinish_Unf               NA           NA      NA                   NA    
## GarageQual_.L        -0.270777481  0.414111788  -0.654             0.513320    
## GarageQual_.Q         0.644670220  0.366129054   1.761             0.078537 .  
## GarageQual_.C        -0.184169018  0.174579040  -1.055             0.291674    
## GarageQual_^4         0.188069595  0.092881468   2.025             0.043111 *  
## GarageQual_^5                  NA           NA      NA                   NA    
## GarageType_2Types    -0.145818373  0.051737647  -2.818             0.004907 ** 
## GarageType_Attchd    -0.000763719  0.010753243  -0.071             0.943392    
## GarageType_Basment   -0.008190734  0.031960141  -0.256             0.797780    
## GarageType_BuiltIn   -0.010263696  0.018543978  -0.553             0.580041    
## GarageType_CarPort    0.015550221  0.046173055   0.337             0.736342    
## GarageType_Detchd              NA           NA      NA                   NA    
## GrLivArea             0.000131977  0.000088654   1.489             0.136843    
## HalfBath              0.027607277  0.009610500   2.873             0.004144 ** 
## HeatingGasA           0.144812212  0.118592803   1.221             0.222298    
## HeatingGasW           0.199141597  0.122221471   1.629             0.103507    
## HeatingGrav          -0.067148788  0.129536226  -0.518             0.604292    
## HeatingOthW           0.131663009  0.145315838   0.906             0.365097    
## HeatingWall           0.167036994  0.136343391   1.225             0.220776    
## HeatingQC_.L          0.069012793  0.076857065   0.898             0.369403    
## HeatingQC_.Q         -0.034893542  0.065082329  -0.536             0.591960    
## HeatingQC_.C          0.037422916  0.040668653   0.920             0.357662    
## HeatingQC_^4         -0.018772311  0.018265275  -1.028             0.304275    
## HouseStyle1.5Unf      0.206833868  0.113685740   1.819             0.069113 .  
## HouseStyle1Story     -0.015592170  0.040580059  -0.384             0.700876    
## HouseStyle2.5Fin     -0.031640765  0.082664088  -0.383             0.701965    
## HouseStyle2.5Unf      0.081868602  0.077767081   1.053             0.292676    
## HouseStyle2Story     -0.002066528  0.037567868  -0.055             0.956142    
## HouseStyleSFoyer     -0.010050362  0.053986994  -0.186             0.852349    
## HouseStyleSLvl        0.050061827  0.063194670   0.792             0.428413    
## KitchenAbvGr         -0.053103714  0.028456272  -1.866             0.062270 .  
## KitchenQual_.L        0.030003504  0.020134990   1.490             0.136462    
## KitchenQual_.Q        0.033262350  0.014236234   2.336             0.019635 *  
## KitchenQual_.C        0.009922890  0.007850449   1.264             0.206485    
## LandContourHLS        0.028435438  0.023718108   1.199             0.230812    
## LandContourLow        0.004249393  0.029920106   0.142             0.887085    
## LandContourLvl        0.030436812  0.017117308   1.778             0.075641 .  
## LandSlopeMod          0.035245072  0.018461242   1.909             0.056488 .  
## LandSlopeSev         -0.207249759  0.052037245  -3.983  0.00007232395635704 ***
## LotArea               0.000002613  0.000000532   4.912  0.00000103128690908 ***
## LotConfigCulDSac      0.033090785  0.014998530   2.206             0.027558 *  
## LotConfigFR2         -0.030903749  0.018239898  -1.694             0.090475 .  
## LotConfigFR3         -0.076328660  0.057088072  -1.337             0.181471    
## LotConfigInside      -0.015786028  0.008230917  -1.918             0.055367 .  
## LotFrontage_2         0.000416990  0.000203867   2.045             0.041038 *  
## LotShapeIR2           0.022767179  0.019182277   1.187             0.235513    
## LotShapeIR3           0.026080503  0.040083475   0.651             0.515397    
## LotShapeReg           0.010276264  0.007356104   1.397             0.162686    
## LowQualFinSF                   NA           NA      NA                   NA    
## MasVnrDummy_          0.008740393  0.054428025   0.161             0.872447    
## MasVnrArea            0.000018014  0.000026693   0.675             0.499891    
## MasVnrType_BrkCmn    -0.050651434  0.061534448  -0.823             0.410596    
## MasVnrType_BrkFace    0.003040660  0.053954994   0.056             0.955068    
## MasVnrType_Stone      0.010707835  0.055362722   0.193             0.846670    
## MiscFeatureDummy_     0.025162395  0.216947184   0.116             0.907685    
## MiscFeature_Gar2      0.475576533  0.452685704   1.051             0.293674    
## MiscFeature_Othr     -0.048931583  0.233331931  -0.210             0.833932    
## MiscFeature_Shed     -0.008174406  0.213546133  -0.038             0.969471    
## MiscFeature_TenC               NA           NA      NA                   NA    
## MiscVal              -0.000028703  0.000028913  -0.993             0.321041    
## MoSold               -0.000841969  0.001133129  -0.743             0.457602    
## MSSubClass_30        -0.066385166  0.021739803  -3.054             0.002312 ** 
## MSSubClass_40        -0.066457794  0.080257658  -0.828             0.407808    
## MSSubClass_45        -0.222128184  0.115279404  -1.927             0.054237 .  
## MSSubClass_50        -0.005352633  0.039925361  -0.134             0.893373    
## MSSubClass_60        -0.034456391  0.035227130  -0.978             0.328216    
## MSSubClass_70         0.012526961  0.038333617   0.327             0.743887    
## MSSubClass_75        -0.054579869  0.075152636  -0.726             0.467828    
## MSSubClass_80        -0.070101787  0.057809928  -1.213             0.225517    
## MSSubClass_85        -0.016866191  0.048623780  -0.347             0.728750    
## MSSubClass_90                  NA           NA      NA                   NA    
## MSSubClass_120       -0.048999617  0.066440675  -0.737             0.460969    
## MSSubClass_160       -0.152536948  0.079986058  -1.907             0.056759 .  
## MSSubClass_180       -0.074324722  0.089688845  -0.829             0.407445    
## MSSubClass_190        0.082093056  0.129448085   0.634             0.526089    
## MSZoningFV            0.507257436  0.056715436   8.944 < 0.0000000000000002 ***
## MSZoningRH            0.459405375  0.056449639   8.138  0.00000000000000101 ***
## MSZoningRL            0.469954925  0.048748364   9.640 < 0.0000000000000002 ***
## MSZoningRM            0.419635019  0.045840089   9.154 < 0.0000000000000002 ***
## NeighborhoodBlueste   0.042994824  0.091286257   0.471             0.637736    
## NeighborhoodBrDale   -0.000690402  0.053836997  -0.013             0.989770    
## NeighborhoodBrkSide   0.025410867  0.044086222   0.576             0.564462    
## NeighborhoodClearCr   0.028648590  0.042712467   0.671             0.502524    
## NeighborhoodCollgCr  -0.013920515  0.033435039  -0.416             0.677234    
## NeighborhoodCrawfor   0.115649202  0.039695386   2.913             0.003643 ** 
## NeighborhoodEdwards  -0.076832370  0.037119160  -2.070             0.038682 *  
## NeighborhoodGilbert  -0.005989834  0.035376559  -0.169             0.865577    
## NeighborhoodIDOTRR   -0.007161678  0.049931090  -0.143             0.885974    
## NeighborhoodMeadowV  -0.141506636  0.056385388  -2.510             0.012220 *  
## NeighborhoodMitchel  -0.049640680  0.037697474  -1.317             0.188157    
## NeighborhoodNAmes    -0.036925489  0.036181459  -1.021             0.307672    
## NeighborhoodNoRidge   0.037503291  0.038988250   0.962             0.336292    
## NeighborhoodNPkVill   0.000191903  0.063798006   0.003             0.997600    
## NeighborhoodNridgHt   0.076817346  0.034624316   2.219             0.026705 *  
## NeighborhoodNWAmes   -0.036169347  0.036953587  -0.979             0.327892    
## NeighborhoodOldTown  -0.028726008  0.044652342  -0.643             0.520138    
## NeighborhoodSawyer   -0.026306301  0.037380125  -0.704             0.481728    
## NeighborhoodSawyerW   0.001245753  0.036037458   0.035             0.972430    
## NeighborhoodSomerst   0.026424328  0.041577607   0.636             0.525199    
## NeighborhoodStoneBr   0.130732928  0.038541185   3.392             0.000717 ***
## NeighborhoodSWISU     0.012978591  0.044932308   0.289             0.772749    
## NeighborhoodTimber    0.014362237  0.037694190   0.381             0.703257    
## NeighborhoodVeenker   0.049618005  0.048097839   1.032             0.302470    
## OpenPorchSF           0.000040980  0.000053525   0.766             0.444057    
## OverallCond_.L       -0.693702489  0.781309444  -0.888             0.374792    
## OverallCond_.Q        0.955873073  0.795563817   1.202             0.229798    
## OverallCond_.C       -0.750575508  0.658791405  -1.139             0.254802    
## OverallCond_^4        0.469494467  0.456123815   1.029             0.303545    
## OverallCond_^5       -0.205417721  0.262979452  -0.781             0.434891    
## OverallCond_^6        0.072118655  0.117923696   0.612             0.540940    
## OverallCond_^7       -0.003595957  0.037932501  -0.095             0.924491    
## OverallCond_^8                 NA           NA      NA                   NA    
## OverallQual_.L        0.623062772  0.083879178   7.428  0.00000000000021146 ***
## OverallQual_.Q       -0.267740145  0.075131373  -3.564             0.000380 ***
## OverallQual_.C        0.216330376  0.062535500   3.459             0.000561 ***
## OverallQual_^4       -0.162365647  0.053538655  -3.033             0.002477 ** 
## OverallQual_^5        0.042387295  0.046642341   0.909             0.363657    
## OverallQual_^6       -0.038000300  0.037702798  -1.008             0.313715    
## OverallQual_^7        0.000664531  0.027296199   0.024             0.980581    
## OverallQual_^8        0.000274743  0.017189173   0.016             0.987250    
## OverallQual_^9       -0.004159637  0.009087225  -0.458             0.647220    
## PavedDriveP          -0.017682013  0.025572935  -0.691             0.489429    
## PavedDriveY           0.011565676  0.016001476   0.723             0.469954    
## PoolArea              0.001433329  0.001047376   1.368             0.171419    
## PoolDummy_           -1.251618726  0.730576714  -1.713             0.086941 .  
## PoolQC_.L             0.417470059  0.313550053   1.331             0.183306    
## PoolQC_.Q            -0.204436262  0.248829036  -0.822             0.411475    
## PoolQC_.C                      NA           NA      NA                   NA    
## RoofMatlCompShg       2.554739903  0.244189655  10.462 < 0.0000000000000002 ***
## RoofMatlMembran       2.999456976  0.289234765  10.370 < 0.0000000000000002 ***
## RoofMatlMetal         2.837026139  0.286760810   9.893 < 0.0000000000000002 ***
## RoofMatlRoll          2.555356893  0.271146428   9.424 < 0.0000000000000002 ***
## RoofMatlTar&Grv       2.571263567  0.261107554   9.848 < 0.0000000000000002 ***
## RoofMatlWdShake       2.502067474  0.255537071   9.791 < 0.0000000000000002 ***
## RoofMatlWdShngl       2.616999481  0.248367475  10.537 < 0.0000000000000002 ***
## RoofStyleGable        0.021500243  0.083911505   0.256             0.797822    
## RoofStyleGambrel      0.004287777  0.092064495   0.047             0.962861    
## RoofStyleHip          0.020691550  0.084162719   0.246             0.805840    
## RoofStyleMansard      0.070973087  0.097541404   0.728             0.466991    
## RoofStyleShed         0.513396114  0.174309907   2.945             0.003290 ** 
## SaleConditionAdjLand  0.113591167  0.066655549   1.704             0.088618 .  
## SaleConditionAlloca   0.059815217  0.043150968   1.386             0.165954    
## SaleConditionFamily   0.017831900  0.027774990   0.642             0.520990    
## SaleConditionNormal   0.060991587  0.013253675   4.602  0.00000464269700382 ***
## SaleConditionPartial  0.031257290  0.067681482   0.462             0.644289    
## SaleTypeCon           0.090469010  0.079521495   1.138             0.255492    
## SaleTypeConLD         0.104755653  0.045770712   2.289             0.022274 *  
## SaleTypeConLI        -0.025852769  0.052678319  -0.491             0.623683    
## SaleTypeConLw         0.017920986  0.056716579   0.316             0.752078    
## SaleTypeCWD           0.045030248  0.058767353   0.766             0.443684    
## SaleTypeNew           0.057201304  0.070579539   0.810             0.417845    
## SaleTypeOth           0.060390376  0.066378410   0.910             0.363120    
## SaleTypeWD           -0.016541858  0.019174313  -0.863             0.388474    
## ScreenPorch           0.000287243  0.000057067   5.033  0.00000055715083132 ***
## StreetPave            0.110068686  0.058747765   1.874             0.061237 .  
## TotalBsmtSF                    NA           NA      NA                   NA    
## TotRmsAbvGrd          0.001170065  0.004403557   0.266             0.790510    
## UtilitiesNoSeWa      -0.298669260  0.129475558  -2.307             0.021242 *  
## WoodDeckSF            0.000088345  0.000026826   3.293             0.001020 ** 
## YearBuilt             0.001704179  0.000380058   4.484  0.00000804415592358 ***
## YearRemodAdd          0.000892482  0.000255912   3.487             0.000506 ***
## YrSold               -0.002734777  0.002348348  -1.165             0.244436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.102 on 1173 degrees of freedom
## Multiple R-squared:  0.9472, Adjusted R-squared:  0.9348 
## F-statistic: 76.54 on 275 and 1173 DF,  p-value: < 0.00000000000000022

Esse modelo nos dá um R² de 93.5 %. O que normalmente pode significar que está mais ajustado do que o necessário (overfitting).

autoplot(houses_model1)

Modelo 2

Vamos aplicar o stepwise.

houses_model1_step <- houses_model1 %>% 
  stepAIC(direction = "both", trace = FALSE)

Abaixo temos a ANOVA do stepwise, mostrando o modelo inicial e o final.

houses_model1_step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## SalePrice_log ~ `1stFlrSF` + `2ndFlrSF` + `3SsnPorch` + Alley_ + 
##     BedroomAbvGr + BldgType + BsmtDummy_ + BsmtCond_ + BsmtExposure_ + 
##     BsmtFinSF1 + BsmtFinSF2 + BsmtFinType1_ + BsmtFinType2_ + 
##     BsmtFullBath + BsmtHalfBath + BsmtQual_ + BsmtUnfSF + CentralAir_ + 
##     Condition1 + Condition2 + Electrical + EnclosedPorch + ExterCond_ + 
##     Exterior1st + Exterior2nd + ExterQual_ + FenceDummy_ + Fence_ + 
##     FirePlaceDummy_ + FireplaceQu_ + Fireplaces + Foundation + 
##     FullBath + Functional + GarageArea + GarageCars + GarageDummy_ + 
##     GarageCond_ + GarageFinish_ + GarageQual_ + GarageType_ + 
##     GrLivArea + HalfBath + Heating + HeatingQC_ + HouseStyle + 
##     KitchenAbvGr + KitchenQual_ + LandContour + LandSlope + LotArea + 
##     LotConfig + LotFrontage_2 + LotShape + LowQualFinSF + MasVnrDummy_ + 
##     MasVnrArea + MasVnrType_ + MiscFeatureDummy_ + MiscFeature_ + 
##     MiscVal + MoSold + MSSubClass_ + MSZoning + Neighborhood + 
##     OpenPorchSF + OverallCond_ + OverallQual_ + PavedDrive + 
##     PoolArea + PoolDummy_ + PoolQC_ + RoofMatl + RoofStyle + 
##     SaleCondition + SaleType + ScreenPorch + Street + TotalBsmtSF + 
##     TotRmsAbvGrd + Utilities + WoodDeckSF + YearBuilt + YearRemodAdd + 
##     YrSold
## 
## Final Model:
## SalePrice_log ~ `3SsnPorch` + BsmtExposure_ + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtFullBath + BsmtUnfSF + CentralAir_ + Condition1 + Condition2 + 
##     Electrical + EnclosedPorch + Exterior1st + Fireplaces + Foundation + 
##     FullBath + Functional + GarageArea + GarageCars + GarageCond_ + 
##     GarageQual_ + GarageType_ + GrLivArea + HalfBath + Heating + 
##     HeatingQC_ + KitchenAbvGr + KitchenQual_ + LandSlope + LotArea + 
##     LotConfig + LotFrontage_2 + MasVnrType_ + MSSubClass_ + MSZoning + 
##     Neighborhood + OverallCond_ + OverallQual_ + PoolArea + PoolQC_ + 
##     RoofMatl + RoofStyle + SaleCondition + SaleType + ScreenPorch + 
##     Street + Utilities + WoodDeckSF + YearBuilt + YearRemodAdd + 
##     FenceDummy_ + LowQualFinSF + ExterCond_
## 
## 
##                   Step Df                Deviance Resid. Df Resid. Dev
## 1                                                      1173   12.20025
## 2        - TotalBsmtSF  0 0.000000000000000000000      1173   12.20025
## 3         - PoolDummy_  0 0.000000000000010658141      1173   12.20025
## 4  - MiscFeatureDummy_  0 0.000000000000007105427      1173   12.20025
## 5       - LowQualFinSF  0 0.000000000000000000000      1173   12.20025
## 6       - GarageDummy_  0 0.000000000000003552714      1173   12.20025
## 7    - FirePlaceDummy_  0 0.000000000000007105427      1173   12.20025
## 8        - FenceDummy_  0 0.000000000000000000000      1173   12.20025
## 9         - BsmtDummy_  0 0.000000000000021316282      1173   12.20025
## 10       - Exterior2nd 14 0.139208557697136470210      1187   12.33946
## 11      - FireplaceQu_  5 0.035569769176275300993      1192   12.37503
## 12      - MiscFeature_  4 0.021123364712414272049      1196   12.39616
## 13        - ExterQual_  3 0.007775983259554308802      1199   12.40393
## 14          - BldgType  3 0.011233859814323565729      1202   12.41517
## 15        - HouseStyle  7 0.094423567560646404218      1209   12.50959
## 16         - BsmtQual_  3 0.025286597369474250740      1212   12.53488
## 17        - PavedDrive  2 0.014050766404931636089      1214   12.54893
## 18          - LotShape  3 0.032901078447157061646      1217   12.58183
## 19           - MiscVal  1 0.000392805301846266275      1218   12.58222
## 20      - MasVnrDummy_  1 0.000543791402053628303      1219   12.58276
## 21     - GarageFinish_  2 0.019805218824139103617      1221   12.60257
## 22     - BsmtFinType1_  5 0.069808384233837728061      1226   12.67238
## 23     - BsmtFinType2_  5 0.059448110061602221776      1231   12.73183
## 24            - Alley_  2 0.017042286764628045148      1233   12.74887
## 25      - TotRmsAbvGrd  1 0.003042152155099842048      1234   12.75191
## 26       - LandContour  3 0.040491804005055698212      1237   12.79240
## 27      - BedroomAbvGr  1 0.005958859604271893318      1238   12.79836
## 28        - MasVnrArea  1 0.006463273712499884027      1239   12.80482
## 29         - BsmtCond_  2 0.026785073492627375913      1241   12.83161
## 30            - Fence_  4 0.059589501244804665703      1245   12.89120
## 31       + FenceDummy_  1 0.028468327195056986056      1244   12.86273
## 32            - YrSold  1 0.008930977115221949703      1245   12.87166
## 33            - MoSold  1 0.015995090802931599683      1246   12.88766
## 34       - OpenPorchSF  1 0.015730955107287059036      1247   12.90339
## 35        - ExterCond_  4 0.070158192292607779450      1251   12.97355
## 36        - `1stFlrSF`  1 0.016142182511673652812      1252   12.98969
## 37        - `2ndFlrSF`  1 0.009009738751784013289      1253   12.99870
## 38      + LowQualFinSF  1 0.022328129377410022016      1252   12.97637
## 39      - BsmtHalfBath  1 0.016684227979824939325      1253   12.99305
## 40        + ExterCond_  4 0.071605291085855782285      1249   12.92145
##          AIC
## 1  -6370.122
## 2  -6370.122
## 3  -6370.122
## 4  -6370.122
## 5  -6370.122
## 6  -6370.122
## 7  -6370.122
## 8  -6370.122
## 9  -6370.122
## 10 -6381.683
## 11 -6387.512
## 12 -6393.040
## 13 -6398.132
## 14 -6402.820
## 15 -6405.841
## 16 -6408.915
## 17 -6411.292
## 18 -6413.498
## 19 -6415.453
## 20 -6417.390
## 21 -6419.111
## 22 -6421.107
## 23 -6424.325
## 24 -6426.387
## 25 -6428.041
## 26 -6429.448
## 27 -6430.773
## 28 -6432.041
## 29 -6433.013
## 30 -6434.300
## 31 -6435.503
## 32 -6436.498
## 33 -6436.698
## 34 -6436.930
## 35 -6437.073
## 36 -6437.272
## 37 -6438.267
## 38 -6438.758
## 39 -6438.896
## 40 -6438.904
summary(houses_model1_step)
## 
## Call:
## lm(formula = SalePrice_log ~ `3SsnPorch` + BsmtExposure_ + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtFullBath + BsmtUnfSF + CentralAir_ + Condition1 + 
##     Condition2 + Electrical + EnclosedPorch + Exterior1st + Fireplaces + 
##     Foundation + FullBath + Functional + GarageArea + GarageCars + 
##     GarageCond_ + GarageQual_ + GarageType_ + GrLivArea + HalfBath + 
##     Heating + HeatingQC_ + KitchenAbvGr + KitchenQual_ + LandSlope + 
##     LotArea + LotConfig + LotFrontage_2 + MasVnrType_ + MSSubClass_ + 
##     MSZoning + Neighborhood + OverallCond_ + OverallQual_ + PoolArea + 
##     PoolQC_ + RoofMatl + RoofStyle + SaleCondition + SaleType + 
##     ScreenPorch + Street + Utilities + WoodDeckSF + YearBuilt + 
##     YearRemodAdd + FenceDummy_ + LowQualFinSF + ExterCond_, data = ds2_all_train[, 
##     -c(1, 2)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71481 -0.04430  0.00000  0.05169  0.71481 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate    Std. Error t value             Pr(>|t|)
## (Intercept)           1.5939901752  0.8505527102   1.874             0.061155
## `3SsnPorch`           0.0001702361  0.0000986421   1.726             0.084631
## BsmtExposure_.L       0.0360571008  0.0262087980   1.376             0.169142
## BsmtExposure_.Q       0.0178964773  0.0213813806   0.837             0.402746
## BsmtExposure_.C       0.0088692989  0.0133846662   0.663             0.507679
## BsmtExposure_^4       0.0055779590  0.0096623689   0.577             0.563850
## BsmtFinSF1            0.0001550841  0.0000186488   8.316 0.000000000000000235
## BsmtFinSF2            0.0001287205  0.0000245961   5.233 0.000000195073367119
## BsmtFullBath          0.0259546275  0.0079501827   3.265             0.001126
## BsmtUnfSF             0.0000873422  0.0000171695   5.087 0.000000419266057225
## CentralAir_Yes        0.0562325229  0.0168047792   3.346             0.000844
## Condition1Feedr       0.0401390992  0.0220354363   1.822             0.068759
## Condition1Norm        0.0857642955  0.0184264541   4.654 0.000003595519072802
## Condition1PosA        0.0515352869  0.0439986750   1.171             0.241705
## Condition1PosN        0.0856761768  0.0325265509   2.634             0.008542
## Condition1RRAe       -0.0294034047  0.0404187177  -0.727             0.467074
## Condition1RRAn        0.0567171792  0.0302220454   1.877             0.060795
## Condition1RRNe        0.0104477439  0.0778754655   0.134             0.893298
## Condition1RRNn        0.0882010720  0.0552840961   1.595             0.110872
## Condition2Feedr       0.0663088473  0.1032794893   0.642             0.520970
## Condition2Norm        0.0541635870  0.0886183470   0.611             0.541178
## Condition2PosA        0.2793525829  0.1632918100   1.711             0.087374
## Condition2PosN       -0.8084067809  0.1239798912  -6.520 0.000000000101508786
## Condition2RRAe       -0.5192829271  0.2039285133  -2.546             0.011003
## Condition2RRAn       -0.0595534228  0.1399725127  -0.425             0.670571
## Condition2RRNn        0.0147132963  0.1184915689   0.124             0.901199
## ElectricalFuseF       0.0069076109  0.0255303731   0.271             0.786771
## ElectricalFuseP      -0.0962344868  0.0768907670  -1.252             0.210959
## ElectricalMix        -0.2178673962  0.1677568638  -1.299             0.194283
## ElectricalSBrkr      -0.0296079760  0.0130852418  -2.263             0.023825
## EnclosedPorch         0.0001211373  0.0000550646   2.200             0.027996
## Exterior1stAsphShn    0.0594209412  0.1135780928   0.523             0.600947
## Exterior1stBrkComm   -0.1724913735  0.0921228781  -1.872             0.061384
## Exterior1stBrkFace    0.0812902414  0.0335430626   2.423             0.015515
## Exterior1stCBlock    -0.0909096805  0.1106933886  -0.821             0.411647
## Exterior1stCemntBd    0.0433312336  0.0354469219   1.222             0.221777
## Exterior1stHdBoard    0.0054102264  0.0311657088   0.174             0.862212
## Exterior1stImStucc    0.0205899743  0.1093050342   0.188             0.850616
## Exterior1stMetalSd    0.0363806981  0.0304762157   1.194             0.232806
## Exterior1stPlywood    0.0079312377  0.0325583744   0.244             0.807580
## Exterior1stStone      0.0387793983  0.0886249181   0.438             0.661775
## Exterior1stStucco     0.0396440683  0.0382939444   1.035             0.300749
## Exterior1stVinylSd    0.0282002659  0.0306399156   0.920             0.357554
## Exterior1stWd Sdng    0.0077095913  0.0302968609   0.254             0.799176
## Exterior1stWdShing    0.0231456227  0.0367689430   0.629             0.529144
## Fireplaces            0.0251342455  0.0058429127   4.302 0.000018273836929471
## FoundationCBlock     -0.0003759814  0.0140265191  -0.027             0.978620
## FoundationPConc       0.0265828095  0.0153474844   1.732             0.083509
## FoundationSlab       -0.0006266788  0.0438561618  -0.014             0.988601
## FoundationStone       0.0882249823  0.0487181139   1.811             0.070392
## FoundationWood       -0.1480473318  0.0653095390  -2.267             0.023569
## FullBath              0.0297405779  0.0095691024   3.108             0.001926
## FunctionalMaj2       -0.2614801931  0.0641778918  -4.074 0.000049062098673843
## FunctionalMin1       -0.0022479944  0.0386133151  -0.058             0.953584
## FunctionalMin2       -0.0254866644  0.0388651647  -0.656             0.512092
## FunctionalMod        -0.1105689899  0.0461520562  -2.396             0.016733
## FunctionalSev        -0.2767224374  0.1278759345  -2.164             0.030654
## FunctionalTyp         0.0220029869  0.0342871656   0.642             0.521168
## GarageArea            0.0001144000  0.0000334845   3.417             0.000655
## GarageCars            0.0196957828  0.0100996224   1.950             0.051382
## GarageCond_.L         0.0678343947  0.3505109246   0.194             0.846575
## GarageCond_.Q        -0.3867549131  0.3382736315  -1.143             0.253125
## GarageCond_.C         0.1296990494  0.2024959423   0.641             0.521964
## GarageCond_^4        -0.2087532238  0.1152065349  -1.812             0.070228
## GarageCond_^5         0.0770936152  0.0353222106   2.183             0.029252
## GarageQual_.L        -0.0306606142  0.3743794837  -0.082             0.934742
## GarageQual_.Q         0.4219833971  0.3306349084   1.276             0.202093
## GarageQual_.C        -0.0712230889  0.1594069675  -0.447             0.655097
## GarageQual_^4         0.1220914437  0.0848839862   1.438             0.150590
## GarageQual_^5                   NA            NA      NA                   NA
## GarageType_2Types    -0.1473163207  0.0485613253  -3.034             0.002466
## GarageType_Attchd     0.0034001395  0.0098296822   0.346             0.729472
## GarageType_Basment    0.0042347403  0.0299485052   0.141             0.887576
## GarageType_BuiltIn   -0.0084976154  0.0168305688  -0.505             0.613724
## GarageType_CarPort   -0.0109997689  0.0422427947  -0.260             0.794603
## GarageType_Detchd               NA            NA      NA                   NA
## GrLivArea             0.0002326115  0.0000156904  14.825 < 0.0000000000000002
## HalfBath              0.0310538257  0.0091381306   3.398             0.000700
## HeatingGasA           0.1530122251  0.1166363901   1.312             0.189804
## HeatingGasW           0.2081371498  0.1200491517   1.734             0.083206
## HeatingGrav          -0.0382546636  0.1263054197  -0.303             0.762036
## HeatingOthW           0.1563865734  0.1408509942   1.110             0.267084
## HeatingWall           0.1682731182  0.1323533944   1.271             0.203826
## HeatingQC_.L          0.0672561636  0.0752402239   0.894             0.371555
## HeatingQC_.Q         -0.0312863218  0.0636278020  -0.492             0.623012
## HeatingQC_.C          0.0356583328  0.0395634082   0.901             0.367605
## HeatingQC_^4         -0.0199051373  0.0175715946  -1.133             0.257515
## KitchenAbvGr         -0.0536142644  0.0248563834  -2.157             0.031199
## KitchenQual_.L        0.0275260255  0.0190288481   1.447             0.148276
## KitchenQual_.Q        0.0330279270  0.0133850279   2.468             0.013738
## KitchenQual_.C        0.0121970718  0.0073646491   1.656             0.097940
## LandSlopeMod          0.0214325640  0.0156802388   1.367             0.171918
## LandSlopeSev         -0.2149337013  0.0497921933  -4.317 0.000017096444735594
## LotArea               0.0000023320  0.0000004784   4.874 0.000001231314500616
## LotConfigCulDSac      0.0308504766  0.0141473336   2.181             0.029395
## LotConfigFR2         -0.0262207133  0.0175747281  -1.492             0.135963
## LotConfigFR3         -0.0708502987  0.0555036949  -1.276             0.202017
## LotConfigInside      -0.0138316620  0.0078668466  -1.758             0.078955
## LotFrontage_2         0.0004710405  0.0001943370   2.424             0.015499
## MasVnrType_BrkCmn    -0.0384680288  0.0293104678  -1.312             0.189615
## MasVnrType_BrkFace    0.0152144561  0.0080042494   1.901             0.057559
## MasVnrType_Stone      0.0175956814  0.0130390453   1.349             0.177434
## MSSubClass_30        -0.0629128298  0.0207457890  -3.033             0.002475
## MSSubClass_40        -0.0779182228  0.0643329787  -1.211             0.226059
## MSSubClass_45        -0.0084977794  0.0360908788  -0.235             0.813894
## MSSubClass_50         0.0064697005  0.0173381299   0.373             0.709101
## MSSubClass_60        -0.0142774631  0.0157975612  -0.904             0.366288
## MSSubClass_70         0.0171098742  0.0248556554   0.688             0.491348
## MSSubClass_75        -0.0124097477  0.0376824259  -0.329             0.741966
## MSSubClass_80        -0.0024194686  0.0175374099  -0.138             0.890294
## MSSubClass_85        -0.0083286644  0.0255520663  -0.326             0.744518
## MSSubClass_90        -0.0207801960  0.0282754640  -0.735             0.462526
## MSSubClass_120       -0.0337612584  0.0176946050  -1.908             0.056621
## MSSubClass_160       -0.1207651163  0.0266628197  -4.529 0.000006482500190032
## MSSubClass_180       -0.0491892341  0.0464786924  -1.058             0.290115
## MSSubClass_190       -0.0060458822  0.0290128639  -0.208             0.834961
## MSZoningFV            0.4780987817  0.0536430388   8.913 < 0.0000000000000002
## MSZoningRH            0.4391174056  0.0530905191   8.271 0.000000000000000337
## MSZoningRL            0.4373475809  0.0454806318   9.616 < 0.0000000000000002
## MSZoningRM            0.3814519058  0.0425375972   8.967 < 0.0000000000000002
## NeighborhoodBlueste   0.0445116275  0.0857232939   0.519             0.603680
## NeighborhoodBrDale   -0.0037230421  0.0495338732  -0.075             0.940098
## NeighborhoodBrkSide   0.0488352892  0.0409475155   1.193             0.233240
## NeighborhoodClearCr   0.0294752067  0.0391437679   0.753             0.451593
## NeighborhoodCollgCr  -0.0113941116  0.0311035473  -0.366             0.714182
## NeighborhoodCrawfor   0.1187842789  0.0370216306   3.209             0.001368
## NeighborhoodEdwards  -0.0694773296  0.0350207222  -1.984             0.047487
## NeighborhoodGilbert  -0.0111662987  0.0333522834  -0.335             0.737833
## NeighborhoodIDOTRR    0.0057583280  0.0469397919   0.123             0.902384
## NeighborhoodMeadowV  -0.1273509577  0.0529941105  -2.403             0.016401
## NeighborhoodMitchel  -0.0437587740  0.0352078258  -1.243             0.214149
## NeighborhoodNAmes    -0.0271832809  0.0336423591  -0.808             0.419240
## NeighborhoodNoRidge   0.0376324454  0.0355607962   1.058             0.290143
## NeighborhoodNPkVill   0.0387884483  0.0490504292   0.791             0.429218
## NeighborhoodNridgHt   0.0771237095  0.0317876734   2.426             0.015398
## NeighborhoodNWAmes   -0.0258621672  0.0346819281  -0.746             0.455992
## NeighborhoodOldTown  -0.0093279540  0.0413113648  -0.226             0.821397
## NeighborhoodSawyer   -0.0179160176  0.0351466632  -0.510             0.610317
## NeighborhoodSawyerW  -0.0016744072  0.0339786062  -0.049             0.960705
## NeighborhoodSomerst   0.0319077290  0.0393629818   0.811             0.417748
## NeighborhoodStoneBr   0.1148015258  0.0359860078   3.190             0.001457
## NeighborhoodSWISU     0.0069495298  0.0421297550   0.165             0.869006
## NeighborhoodTimber    0.0120139149  0.0352849246   0.340             0.733550
## NeighborhoodVeenker   0.0358974241  0.0457577146   0.785             0.432889
## OverallCond_.L       -0.1319417955  0.0951550507  -1.387             0.165812
## OverallCond_.Q        0.3824512932  0.0911940039   4.194 0.000029365398310457
## OverallCond_.C       -0.2787829500  0.0767179559  -3.634             0.000291
## OverallCond_^4        0.1358677986  0.0626837302   2.168             0.030384
## OverallCond_^5       -0.0090925339  0.0490347224  -0.185             0.852921
## OverallCond_^6       -0.0236868515  0.0351332901  -0.674             0.500309
## OverallCond_^7        0.0380427988  0.0225970994   1.684             0.092523
## OverallCond_^8       -0.0140791770  0.0126078844  -1.117             0.264339
## OverallQual_.L        0.6402649771  0.0759114784   8.434 < 0.0000000000000002
## OverallQual_.Q       -0.2393413576  0.0676010883  -3.540             0.000414
## OverallQual_.C        0.1946001421  0.0572585586   3.399             0.000699
## OverallQual_^4       -0.1399556146  0.0504434429  -2.775             0.005611
## OverallQual_^5        0.0287037317  0.0446591299   0.643             0.520518
## OverallQual_^6       -0.0313080904  0.0362565140  -0.864             0.388019
## OverallQual_^7       -0.0036557045  0.0263099231  -0.139             0.889514
## OverallQual_^8        0.0014004794  0.0165260612   0.085             0.932479
## OverallQual_^9       -0.0028880407  0.0087529552  -0.330             0.741493
## PoolArea              0.0013123476  0.0007527593   1.743             0.081513
## PoolQC_.L            -0.3991693831  0.2868619650  -1.392             0.164321
## PoolQC_.Q             0.3592532661  0.2691699291   1.335             0.182227
## PoolQC_.C            -0.2965629133  0.0995802153  -2.978             0.002956
## RoofMatlCompShg       2.6895334574  0.2017841282  13.329 < 0.0000000000000002
## RoofMatlMembran       3.0848381418  0.2504240198  12.318 < 0.0000000000000002
## RoofMatlMetal         2.9039006924  0.2501389062  11.609 < 0.0000000000000002
## RoofMatlRoll          2.7307246681  0.2304850908  11.848 < 0.0000000000000002
## RoofMatlTar&Grv       2.6904741834  0.2183561179  12.321 < 0.0000000000000002
## RoofMatlWdShake       2.6359194140  0.2134980999  12.346 < 0.0000000000000002
## RoofMatlWdShngl       2.7447125556  0.2055421217  13.354 < 0.0000000000000002
## RoofStyleGable       -0.0096175087  0.0802063945  -0.120             0.904574
## RoofStyleGambrel     -0.0159178401  0.0877672585  -0.181             0.856111
## RoofStyleHip         -0.0080760950  0.0804007340  -0.100             0.920005
## RoofStyleMansard      0.0346144904  0.0919610809   0.376             0.706681
## RoofStyleShed         0.4638968258  0.1634599236   2.838             0.004613
## SaleConditionAdjLand  0.1168684962  0.0628761687   1.859             0.063303
## SaleConditionAlloca   0.0554467172  0.0397875744   1.394             0.163696
## SaleConditionFamily   0.0258534029  0.0269496761   0.959             0.337582
## SaleConditionNormal   0.0653567677  0.0128309290   5.094 0.000000405159427511
## SaleConditionPartial  0.0007897260  0.0660326898   0.012             0.990460
## SaleTypeCon           0.0613609367  0.0778190582   0.789             0.430549
## SaleTypeConLD         0.1087649318  0.0434547416   2.503             0.012443
## SaleTypeConLI        -0.0146219111  0.0511111884  -0.286             0.774864
## SaleTypeConLw         0.0395145254  0.0539319975   0.733             0.463895
## SaleTypeCWD           0.0537718839  0.0570825462   0.942             0.346374
## SaleTypeNew           0.0968617403  0.0685646626   1.413             0.157991
## SaleTypeOth           0.0766193354  0.0644892139   1.188             0.235022
## SaleTypeWD           -0.0146968386  0.0185182055  -0.794             0.427554
## ScreenPorch           0.0002681291  0.0000549942   4.876 0.000001224333255908
## StreetPave            0.0884762183  0.0508661268   1.739             0.082212
## UtilitiesNoSeWa      -0.1689408194  0.1128014532  -1.498             0.134468
## WoodDeckSF            0.0000806656  0.0000258081   3.126             0.001815
## YearBuilt             0.0019809563  0.0003460027   5.725 0.000000012922442925
## YearRemodAdd          0.0008693564  0.0002396610   3.627             0.000298
## FenceDummy_          -0.0141490258  0.0081421025  -1.738             0.082499
## LowQualFinSF         -0.0001102286  0.0000715018  -1.542             0.123419
## ExterCond_.L          0.1635372825  0.1005723732   1.626             0.104188
## ExterCond_.Q         -0.0672374281  0.0828442627  -0.812             0.417169
## ExterCond_.C          0.0706441893  0.0513250663   1.376             0.168942
## ExterCond_^4          0.0057785861  0.0216336964   0.267             0.789428
##                         
## (Intercept)          .  
## `3SsnPorch`          .  
## BsmtExposure_.L         
## BsmtExposure_.Q         
## BsmtExposure_.C         
## BsmtExposure_^4         
## BsmtFinSF1           ***
## BsmtFinSF2           ***
## BsmtFullBath         ** 
## BsmtUnfSF            ***
## CentralAir_Yes       ***
## Condition1Feedr      .  
## Condition1Norm       ***
## Condition1PosA          
## Condition1PosN       ** 
## Condition1RRAe          
## Condition1RRAn       .  
## Condition1RRNe          
## Condition1RRNn          
## Condition2Feedr         
## Condition2Norm          
## Condition2PosA       .  
## Condition2PosN       ***
## Condition2RRAe       *  
## Condition2RRAn          
## Condition2RRNn          
## ElectricalFuseF         
## ElectricalFuseP         
## ElectricalMix           
## ElectricalSBrkr      *  
## EnclosedPorch        *  
## Exterior1stAsphShn      
## Exterior1stBrkComm   .  
## Exterior1stBrkFace   *  
## Exterior1stCBlock       
## Exterior1stCemntBd      
## Exterior1stHdBoard      
## Exterior1stImStucc      
## Exterior1stMetalSd      
## Exterior1stPlywood      
## Exterior1stStone        
## Exterior1stStucco       
## Exterior1stVinylSd      
## Exterior1stWd Sdng      
## Exterior1stWdShing      
## Fireplaces           ***
## FoundationCBlock        
## FoundationPConc      .  
## FoundationSlab          
## FoundationStone      .  
## FoundationWood       *  
## FullBath             ** 
## FunctionalMaj2       ***
## FunctionalMin1          
## FunctionalMin2          
## FunctionalMod        *  
## FunctionalSev        *  
## FunctionalTyp           
## GarageArea           ***
## GarageCars           .  
## GarageCond_.L           
## GarageCond_.Q           
## GarageCond_.C           
## GarageCond_^4        .  
## GarageCond_^5        *  
## GarageQual_.L           
## GarageQual_.Q           
## GarageQual_.C           
## GarageQual_^4           
## GarageQual_^5           
## GarageType_2Types    ** 
## GarageType_Attchd       
## GarageType_Basment      
## GarageType_BuiltIn      
## GarageType_CarPort      
## GarageType_Detchd       
## GrLivArea            ***
## HalfBath             ***
## HeatingGasA             
## HeatingGasW          .  
## HeatingGrav             
## HeatingOthW             
## HeatingWall             
## HeatingQC_.L            
## HeatingQC_.Q            
## HeatingQC_.C            
## HeatingQC_^4            
## KitchenAbvGr         *  
## KitchenQual_.L          
## KitchenQual_.Q       *  
## KitchenQual_.C       .  
## LandSlopeMod            
## LandSlopeSev         ***
## LotArea              ***
## LotConfigCulDSac     *  
## LotConfigFR2            
## LotConfigFR3            
## LotConfigInside      .  
## LotFrontage_2        *  
## MasVnrType_BrkCmn       
## MasVnrType_BrkFace   .  
## MasVnrType_Stone        
## MSSubClass_30        ** 
## MSSubClass_40           
## MSSubClass_45           
## MSSubClass_50           
## MSSubClass_60           
## MSSubClass_70           
## MSSubClass_75           
## MSSubClass_80           
## MSSubClass_85           
## MSSubClass_90           
## MSSubClass_120       .  
## MSSubClass_160       ***
## MSSubClass_180          
## MSSubClass_190          
## MSZoningFV           ***
## MSZoningRH           ***
## MSZoningRL           ***
## MSZoningRM           ***
## NeighborhoodBlueste     
## NeighborhoodBrDale      
## NeighborhoodBrkSide     
## NeighborhoodClearCr     
## NeighborhoodCollgCr     
## NeighborhoodCrawfor  ** 
## NeighborhoodEdwards  *  
## NeighborhoodGilbert     
## NeighborhoodIDOTRR      
## NeighborhoodMeadowV  *  
## NeighborhoodMitchel     
## NeighborhoodNAmes       
## NeighborhoodNoRidge     
## NeighborhoodNPkVill     
## NeighborhoodNridgHt  *  
## NeighborhoodNWAmes      
## NeighborhoodOldTown     
## NeighborhoodSawyer      
## NeighborhoodSawyerW     
## NeighborhoodSomerst     
## NeighborhoodStoneBr  ** 
## NeighborhoodSWISU       
## NeighborhoodTimber      
## NeighborhoodVeenker     
## OverallCond_.L          
## OverallCond_.Q       ***
## OverallCond_.C       ***
## OverallCond_^4       *  
## OverallCond_^5          
## OverallCond_^6          
## OverallCond_^7       .  
## OverallCond_^8          
## OverallQual_.L       ***
## OverallQual_.Q       ***
## OverallQual_.C       ***
## OverallQual_^4       ** 
## OverallQual_^5          
## OverallQual_^6          
## OverallQual_^7          
## OverallQual_^8          
## OverallQual_^9          
## PoolArea             .  
## PoolQC_.L               
## PoolQC_.Q               
## PoolQC_.C            ** 
## RoofMatlCompShg      ***
## RoofMatlMembran      ***
## RoofMatlMetal        ***
## RoofMatlRoll         ***
## RoofMatlTar&Grv      ***
## RoofMatlWdShake      ***
## RoofMatlWdShngl      ***
## RoofStyleGable          
## RoofStyleGambrel        
## RoofStyleHip            
## RoofStyleMansard        
## RoofStyleShed        ** 
## SaleConditionAdjLand .  
## SaleConditionAlloca     
## SaleConditionFamily     
## SaleConditionNormal  ***
## SaleConditionPartial    
## SaleTypeCon             
## SaleTypeConLD        *  
## SaleTypeConLI           
## SaleTypeConLw           
## SaleTypeCWD             
## SaleTypeNew             
## SaleTypeOth             
## SaleTypeWD              
## ScreenPorch          ***
## StreetPave           .  
## UtilitiesNoSeWa         
## WoodDeckSF           ** 
## YearBuilt            ***
## YearRemodAdd         ***
## FenceDummy_          .  
## LowQualFinSF            
## ExterCond_.L            
## ExterCond_.Q            
## ExterCond_.C            
## ExterCond_^4            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1017 on 1249 degrees of freedom
## Multiple R-squared:  0.9441, Adjusted R-squared:  0.9352 
## F-statistic:   106 on 199 and 1249 DF,  p-value: < 0.00000000000000022

O modelo Stepwise fornece um R² de 93.5 %, o seja, atinge um número similar de ajustamento com menos variáveis.

autoplot(houses_model1_step)

Modelo 3

##Model 3
houses_model3 <- lm(SalePrice_log ~ OverallQual_ + GrLivArea + GarageArea + TotalBsmtSF + YearBuilt , 
                     data = ds2_all_train)
summary(houses_model3)
## 
## Call:
## lm(formula = SalePrice_log ~ OverallQual_ + GrLivArea + GarageArea + 
##     TotalBsmtSF + YearBuilt, data = ds2_all_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.21210 -0.07304  0.00956  0.08947  0.61658 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     6.13206662  0.39875118  15.378 < 0.0000000000000002 ***
## OverallQual_.L  1.08854209  0.08410090  12.943 < 0.0000000000000002 ***
## OverallQual_.Q -0.21963517  0.07058379  -3.112               0.0019 ** 
## OverallQual_.C -0.09582500  0.06330631  -1.514               0.1303    
## OverallQual_^4 -0.02059410  0.06251442  -0.329               0.7419    
## OverallQual_^5 -0.19411898  0.05934152  -3.271               0.0011 ** 
## OverallQual_^6  0.00787750  0.04984673   0.158               0.8745    
## OverallQual_^7 -0.05997238  0.03632255  -1.651               0.0989 .  
## OverallQual_^8  0.01007919  0.02311834   0.436               0.6629    
## OverallQual_^9 -0.01790371  0.01285412  -1.393               0.1639    
## GrLivArea       0.00025711  0.00001180  21.785 < 0.0000000000000002 ***
## GarageArea      0.00025470  0.00002816   9.045 < 0.0000000000000002 ***
## TotalBsmtSF     0.00011249  0.00001355   8.304 0.000000000000000229 ***
## YearBuilt       0.00260631  0.00020270  12.858 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.173 on 1435 degrees of freedom
## Multiple R-squared:  0.8142, Adjusted R-squared:  0.8125 
## F-statistic: 483.6 on 13 and 1435 DF,  p-value: < 0.00000000000000022
autoplot(houses_model3)

Modelo 4

##Model 4
houses_model4 <- lm(SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + YearBuilt , 
                     data = ds2_all_train)
summary(houses_model4)
## 
## Call:
## lm(formula = SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + 
##     YearBuilt, data = ds2_all_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.82184 -0.07944  0.00853  0.09530  0.66933 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2.42626133 0.39315138   6.171       0.000000000878 ***
## GrLivArea   0.00035472 0.00001186  29.913 < 0.0000000000000002 ***
## GarageArea  0.00034221 0.00003172  10.788 < 0.0000000000000002 ***
## TotalBsmtSF 0.00016446 0.00001482  11.099 < 0.0000000000000002 ***
## YearBuilt   0.00442584 0.00020298  21.804 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.199 on 1444 degrees of freedom
## Multiple R-squared:  0.7525, Adjusted R-squared:  0.7518 
## F-statistic:  1097 on 4 and 1444 DF,  p-value: < 0.00000000000000022
autoplot(houses_model4)

Modelo 5

##Model 5
houses_model5 <- lm(SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + YearBuilt + SaleCondition, 
                     data = ds2_all_train)
summary(houses_model5)
## 
## Call:
## lm(formula = SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + 
##     YearBuilt + SaleCondition, data = ds2_all_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.84368 -0.08098  0.00775  0.09809  0.65373 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          2.71925368 0.39685379   6.852      0.0000000000108 ***
## GrLivArea            0.00035463 0.00001167  30.381 < 0.0000000000000002 ***
## GarageArea           0.00032355 0.00003149  10.276 < 0.0000000000000002 ***
## TotalBsmtSF          0.00016397 0.00001470  11.151 < 0.0000000000000002 ***
## YearBuilt            0.00421979 0.00020497  20.588 < 0.0000000000000002 ***
## SaleConditionAdjLand 0.09175959 0.10012749   0.916                0.360    
## SaleConditionAlloca  0.01625644 0.06242337   0.260                0.795    
## SaleConditionFamily  0.01669116 0.04790987   0.348                0.728    
## SaleConditionNormal  0.13074958 0.02038860   6.413      0.0000000001932 ***
## SaleConditionPartial 0.17255486 0.02795272   6.173      0.0000000008692 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1956 on 1439 degrees of freedom
## Multiple R-squared:  0.7617, Adjusted R-squared:  0.7602 
## F-statistic: 511.1 on 9 and 1439 DF,  p-value: < 0.00000000000000022
autoplot(houses_model5)

Modelo 6

##Model 6
houses_model6 <- lm(SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + YearBuilt + MSSubClass_, 
                     data = ds2_all_train)
summary(houses_model6)
## 
## Call:
## lm(formula = SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + 
##     YearBuilt + MSSubClass_, data = ds2_all_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62169 -0.08364  0.00393  0.09431  0.65494 
## 
## Coefficients:
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     0.36389353  0.61597915   0.591             0.554777    
## GrLivArea       0.00035922  0.00001718  20.915 < 0.0000000000000002 ***
## GarageArea      0.00030443  0.00003040  10.014 < 0.0000000000000002 ***
## TotalBsmtSF     0.00012486  0.00001873   6.668  0.00000000003690232 ***
## YearBuilt       0.00550775  0.00031502  17.484 < 0.0000000000000002 ***
## MSSubClass_30  -0.05177832  0.02813617  -1.840             0.065935 .  
## MSSubClass_40   0.07563853  0.09469935   0.799             0.424584    
## MSSubClass_45   0.06002185  0.05624106   1.067             0.286051    
## MSSubClass_50   0.01802657  0.02268266   0.795             0.426903    
## MSSubClass_60  -0.04013343  0.02088366  -1.922             0.054834 .  
## MSSubClass_70   0.17070546  0.03289483   5.189  0.00000024128523375 ***
## MSSubClass_75   0.09443019  0.05450907   1.732             0.083422 .  
## MSSubClass_80   0.02283255  0.02690850   0.849             0.396287    
## MSSubClass_85   0.01245897  0.04309661   0.289             0.772551    
## MSSubClass_90  -0.23023837  0.02857599  -8.057  0.00000000000000163 ***
## MSSubClass_120  0.01023796  0.02278330   0.449             0.653238    
## MSSubClass_160 -0.20077215  0.02784303  -7.211  0.00000000000089784 ***
## MSSubClass_180 -0.20876303  0.06083173  -3.432             0.000617 ***
## MSSubClass_190 -0.02665014  0.03880041  -0.687             0.492287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1878 on 1430 degrees of freedom
## Multiple R-squared:  0.7819, Adjusted R-squared:  0.7791 
## F-statistic: 284.8 on 18 and 1430 DF,  p-value: < 0.00000000000000022
autoplot(houses_model6)

Cross-Validation

Agora que já temos uma idéa de quais variáveis deverão ser incluídas no modelo, vamos fazer o cross validation.

# Model 1: all variables
set.seed(123)
cv_model1 <- train(
  SalePrice_log ~ ., 
  data = ds2_all_train[,-c(1,2)], 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)

# Model 2: Stepwise selected
set.seed(123)
cv_model2 <- train(
  SalePrice_log ~  `3SsnPorch` + BsmtExposure_ + BsmtFinSF1 + BsmtFinSF2 + 
    BsmtFullBath + BsmtUnfSF + CentralAir_ + Condition1 + Condition2 + 
    Electrical + EnclosedPorch + Exterior1st + Fireplaces + Foundation + 
    FullBath + Functional + GarageArea + GarageCars + GarageCond_ + 
    GarageQual_ + GarageType_ + GrLivArea + HalfBath + Heating + 
    HeatingQC_ + KitchenAbvGr + KitchenQual_ + LandSlope + LotArea + 
    LotConfig + LotFrontage_2 + MasVnrType_ + MSSubClass_ + MSZoning + 
    Neighborhood + OverallCond_ + OverallQual_ + PoolArea + PoolQC_ + 
    RoofMatl + RoofStyle + SaleCondition + SaleType + ScreenPorch + 
    Street + Utilities + WoodDeckSF + YearBuilt + YearRemodAdd + 
    FenceDummy_ + LowQualFinSF + ExterCond_ , 
  data = ds2_all_train, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)

# Model 3: 
set.seed(123)
cv_model3 <- train(
  SalePrice_log ~ OverallQual_ + GrLivArea + GarageArea + TotalBsmtSF + YearBuilt, 
  data = ds2_all_train, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)

# Model 4: 
set.seed(123)
cv_model4 <- train(
  SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + YearBuilt, 
  data = ds2_all_train, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)

# Model 5: 
set.seed(123)
cv_model5 <- train(
  SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + YearBuilt + SaleCondition, 
  data = ds2_all_train, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)

# Model 6: 
set.seed(123)
cv_model6 <- train(
  SalePrice_log ~ GrLivArea + GarageArea + TotalBsmtSF + YearBuilt + MSSubClass_, 
  data = ds2_all_train, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
# extract out of sample performance measures
rbind(
      modelo_1 = cv_model1$results, 
      modelo_stepwise = cv_model2$results, 
      modelo_3 = cv_model3$results,
      modelo_4 = cv_model4$results,
      modelo_5 = cv_model5$results,
      modelo_6 = cv_model6$results
)  %>% 
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
modelo_1 TRUE 0.1984153 0.7792703 0.0989773 0.0827491 0.1320839 0.0125427
modelo_stepwise TRUE 0.1942762 0.7846206 0.0934100 0.0894963 0.1439271 0.0143781
modelo_3 TRUE 0.1751786 0.8094793 0.1178691 0.0358884 0.0747034 0.0109550
modelo_4 TRUE 0.1989073 0.7581153 0.1313164 0.0453922 0.1012749 0.0110731
modelo_5 TRUE 0.1962135 0.7645059 0.1299365 0.0448691 0.0995772 0.0098148
modelo_6 TRUE 0.1892963 0.7808777 0.1251320 0.0423780 0.0913277 0.0098892

Note que com o cross-validation, o R² dos modelos 1 e 2 cairam de 93.5 % e 93.5 % respectivamente, para 77.9 % e 78.5 .

Avaliação [Evaluation]

Usando o quadro acima, decidimos por usar o modelo 6 para prever a base de teste.

O modelo 6 nos dá um ajuste de 78.1 %. Analisando a tablea abaixo temos que

summary(cv_model6)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62169 -0.08364  0.00393  0.09431  0.65494 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     0.36389353  0.61597915   0.591             0.554777    
## GrLivArea       0.00035922  0.00001718  20.915 < 0.0000000000000002 ***
## GarageArea      0.00030443  0.00003040  10.014 < 0.0000000000000002 ***
## TotalBsmtSF     0.00012486  0.00001873   6.668  0.00000000003690232 ***
## YearBuilt       0.00550775  0.00031502  17.484 < 0.0000000000000002 ***
## MSSubClass_30  -0.05177832  0.02813617  -1.840             0.065935 .  
## MSSubClass_40   0.07563853  0.09469935   0.799             0.424584    
## MSSubClass_45   0.06002185  0.05624106   1.067             0.286051    
## MSSubClass_50   0.01802657  0.02268266   0.795             0.426903    
## MSSubClass_60  -0.04013343  0.02088366  -1.922             0.054834 .  
## MSSubClass_70   0.17070546  0.03289483   5.189  0.00000024128523375 ***
## MSSubClass_75   0.09443019  0.05450907   1.732             0.083422 .  
## MSSubClass_80   0.02283255  0.02690850   0.849             0.396287    
## MSSubClass_85   0.01245897  0.04309661   0.289             0.772551    
## MSSubClass_90  -0.23023837  0.02857599  -8.057  0.00000000000000163 ***
## MSSubClass_120  0.01023796  0.02278330   0.449             0.653238    
## MSSubClass_150          NA          NA      NA                   NA    
## MSSubClass_160 -0.20077215  0.02784303  -7.211  0.00000000000089784 ***
## MSSubClass_180 -0.20876303  0.06083173  -3.432             0.000617 ***
## MSSubClass_190 -0.02665014  0.03880041  -0.687             0.492287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1878 on 1430 degrees of freedom
## Multiple R-squared:  0.7819, Adjusted R-squared:  0.7791 
## F-statistic: 284.8 on 18 and 1430 DF,  p-value: < 0.00000000000000022

Já que calculamos o modelo usando o log da variável resposta, para interpretá-lo, temos que fazer a exponencial das variáveis.

exp(summary(cv_model6)$coefficients[,1])
##    (Intercept)      GrLivArea     GarageArea    TotalBsmtSF      YearBuilt 
##      1.4389210      1.0003593      1.0003045      1.0001249      1.0055229 
##  MSSubClass_30  MSSubClass_40  MSSubClass_45  MSSubClass_50  MSSubClass_60 
##      0.9495393      1.0785726      1.0618597      1.0181900      0.9606612 
##  MSSubClass_70  MSSubClass_75  MSSubClass_80  MSSubClass_85  MSSubClass_90 
##      1.1861413      1.0990324      1.0230952      1.0125369      0.7943442 
## MSSubClass_120 MSSubClass_160 MSSubClass_180 MSSubClass_190 
##      1.0102905      0.8180988      0.8115875      0.9737018

Ou seja, o preço de venda das casas no modelo 3 é emplicado pelas variáveis:

  • MSSubClass_
  • GrLivArea
  • GarageArea
  • TotalBsmtSF
  • YearBuilt

Interpretando algumas das variáveis, temos que para cada 1 unidade de mudança na variável Área Total do porão (TotalBsmtSF) o preço de venda da casa aumenta em 1.0001249 . Para o caso da variável que avalia a o tipo de residência (MSSubClass_), como essa é uma variável qualitativa notamos que o fator 20 (casas de 1 andar contruídas depois de 1945) está incluido no intercepto, mas para o caso das casas de 1 andar contruidas até 1945 (30), temos que o preço de venda sa casa aumenta em 1.0785726 .

Gráfico da Variável Predita vs Atual

ds2_all_train$SalePrice_log_pred <- predict(cv_model6, ds2_all_train)
ds2_all_train$SalePrice_pred <- exp(ds2_all_train$SalePrice_log_pred)
ggplot(data=ds2_all_train, aes(x=c(1:nrow(ds2_all_train)))) + 
  geom_point(aes(y = SalePrice, color = "Actual"), alpha = 0.5) +
  geom_point(aes(y = SalePrice_pred, color = "Predicted"), alpha = 0.5) +
  #labs(color='Transported') +
  ggtitle("Gráfico Variável SalePredict") +
  ylab("SalePrice") +
  xlab("ID") 

Aplicação [Deployment]

Vamos então para a parte mais esperada, fazer a predição da base Test. Utilizo o Modelo 6 para fazer a predição:

ds2_all_test$SalePrice_log_pred <- predict(cv_model6, ds2_all_test)
ds2_all_test$SalePrice_pred <- exp(ds2_all_test$SalePrice_log_pred)

Faço o gráfico mostrando a variável predita com a variável resposta que neste caso veio do arquivo sample_submission:

ggplot(data=ds2_all_test, aes(x=c(1:nrow(ds2_all_test)))) + 
  geom_point(aes(y = SalePrice, color = "Actual"), alpha = 0.5) +
  geom_point(aes(y = SalePrice_pred, color = "Predicted"), alpha = 0.5) +
  #labs(color='Transported') +
  ggtitle("Gráfico Variável SalePredict (base test)") +
  ylab("SalePrice") +
  xlab("ID") 

Aparentemente, tivemos alsuns valores bem fora da dos que seriam os corretos. Vamos verificar a acurácia desse modelo comparando com as respostas que estavam no aquivo sample_submission, lembrando que isso não acontece no machine learning, na realidade não temos a resposta, mas vale fazer essa verificação aqui.

MSE <- ds2_all_test %>%
  mutate(error = SalePrice_pred - SalePrice,
         sq.error = error^2) %>%
  summarise(MSE = mean(sq.error))

RMSE <- sqrt(MSE)

table1 <- cbind(MSE, RMSE) 
colnames(table1) <- c("MSE", "RMSE")
table2 <- cbind(cv_model3$results$MAE, cv_model3$results$RMSE)
colnames(table2) <- c("MSE", "RMSE")
table3 <- rbind(table1,table2)
row.names(table3) <- c("test", "train")

round(table3,2) %>% 
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
MSE RMSE
test 4830504118.29 69501.83
train 0.12 0.18

Fazendo essa comparação, notamos que o erro da base de teste é muito grande, mostrando que esse modelo não tem um bom ajuste nessa base. Isso pode ser um indicativo que a base de teste era muito diferente da base de treino.

Por exemplo, se pegarmos o summary da variável resposta de ambas as bases:

summary(ds2_all_train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129900  162000  180545  214000  755000
summary(ds2_all_test$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  135751  168591  179188  178952  186599  281644

Já podemos ver que a base de treino tem uma variabilidade maior, indo de um valor pequeno até um muito grande, já a base de teste possui uma variabilidade menor, indicando assim que outliers no modelo de treino deveriam ter sido melhor tratado, pois podem estar causando interferencia no modelo preditivo final.

Para se obter um melhor ajuste seria necessário voltar alguns passos e rever toda essa modelagem. Contudo, na vida real, no machine leaning, não temos a resposta.

Exportando o arquivo final

test_keep <- c("Id", "SalePrice_pred")
ds2_test_final_submission <- ds2_all_test[,test_keep]

head(ds2_test_final_submission) %>%  
  kbl %>% kable_paper(c("striped", "hover"), full_width = F, fixed_thead = T)
Id SalePrice_pred
1461 135783.6
1462 145277.1
1463 193013.7
1464 191595.4
1465 183365.9
1466 184321.6
write.csv(ds2_test_final_submission,".\\final_submission_house.csv", row.names = FALSE)